一、安装
elasticsearch_exporter与ES集群是分开独立的,不需要对原有的ES集群(可能有很多个)做任何修改,不需要重启,只要能访问es集群即可。可以单独部署在一台服务器上项目地址:elasticsearch_exporter
[root@prod-es-master01 ~]# mkdir /opt/soft/
[root@prod-es-master01 ~]# cd /opt/soft/
[root@prod-es-master01 soft]# wget https://github.com/prometheus-community/elasticsearch_exporter/releases/download/v1.3.0/elasticsearch_exporter-1.3.0.linux-amd64.tar.gz
[root@prod-es-master01 soft]# tar -zxvf elasticsearch_exporter-1.3.0.linux-amd64.tar.gz -C /usr/local/
[root@prod-es-master01 soft]# cd /usr/local/
[root@prod-es-master01 local]# mv elasticsearch_exporter-1.3.0.linux-amd64 elasticsearch_exporter
二、启动exporter
1、常用参数解释
参数选项 参数说明
–es.uri 默认http://localhost:9200,连接到的Elasticsearch节点的地址(主机和端口)
–es.all 默认flase,如果为true,则查询群集中所有节点的统计信息,而不仅仅是查询我们连接到的节点
–es.cluster_settings 默认flase,如果为true,请在统计信息中查询集群设置
–es.indices 默认flase,如果为true,则查询统计信息以获取集群中的所有索引
–es.indices_settings 默认flase,如果为true,则查询集群中所有索引的设置统计信息
–es.shards 默认flase,如果为true,则查询集群中所有索引的统计信息,包括分片级统计信息
–es.snapshots 默认flase,如果为true,则查询集群快照的统计信息
2、Systemd管理
vim /lib/systemd/system/es_exporter.service
[Unit]
Description=The es_exporter
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/elasticsearch_exporter/elasticsearch_exporter --es.all --es.indices --es.cluster_settings --es.indices_settings --es.shards --es.snapshots --es.uri http://user:passwd@ip:9200
Restart=on-failure
[Install]
WantedBy=multi-user.target
注意:es集群开启了x-pack验证则需要使用用户名和密码,反之不需要.
3、启动服务
systemctl daemon-reload
systemctl start es_exporter
systemctl enable es_exporter
4、检查数据
现在通过curl去访问9114端口去检查数据
[root@prod-es-master01 ~]# curl 127.0.0.1:9114/metrics
三、配置监控Job
1、配置prometheus监控
~]# vim /usr/local/prometheus/prometheus.yml
## elasticsearch Cluster
- job_name: "elasticsearch Cluster"
scrape_interval: 60s
scrape_timeout: 60s
static_configs:
- targets: ['192.168.66.86:9114']
relabel_configs:
- source_labels: [__address__]
regex: '(.*)\:9114'
target_label: 'nodeip'
replacement: '1'
- source_labels: [__address__]
regex: '(.*)\:9114'
target_label: 'hostname'
replacement: '1'
#重载prometheus配置
~]# promtool check config /usr/local/prometheus/prometheus.yml
~]# curl -X POST http://127.0.0.1:9090/-/reload
2、添加Grafana监控面板
面板id 13071 13072 13073 13074 2322
四、告警规则创建
groups:
- name: es-cluster-exporter.rules
rules:
- alert: ElasticsearchHeapUsageTooHigh
expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90
for: 2m
labels:
severity: 严重
annotations:
summary: Elasticsearch堆使用率过高 (instance {{ labels.instance }})
description: "Elasticsearch堆使用率高于90%\n VALUE = {{value }}\n LABELS = {{ labels }}"
- alert: ElasticsearchHeapUsageWarning
expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100>80
for: 2m
labels:
severity: 警告
annotations:
summary: Elasticsearch堆使用率警告 (instance {{labels.instance }})
description: "Elasticsearch堆使用率高于80%\n VALUE = {{ value }}\n LABELS = {{labels }}"
- alert: ElasticsearchDiskOutOfSpace
expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 10
for: 0m
labels:
severity: 严重
annotations:
summary: Elasticsearch 磁盘空间不足 (instance {{ labels.instance }})
description: "磁盘使用率超过90%\n VALUE = {{value }}\n LABELS = {{ labels }}"
- alert: ElasticsearchDiskOutOfSpace
expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100labels.instance }})
description: "磁盘使用率超过80%\n VALUE = {{ value }}\n LABELS = {{labels }}"
- alert: ElasticsearchClusterRed
expr: elasticsearch_cluster_health_status{color="red"} == 1
for: 0m
labels:
severity: 严重
annotations:
summary: Elasticsearch 集群状态 Red (instance {{ labels.instance }})
description: "Elastic Cluster Red 状态\n VALUE = {{value }}\n LABELS = {{ labels }}"
- alert: ElasticsearchClusterYellow
expr: elasticsearch_cluster_health_status{color="yellow"} == 1
for: 0m
labels:
severity: 警告
annotations:
summary: Elasticsearch 集群状态 Yellow (instance {{labels.instance }})
description: "Elastic Cluster Yellow 状态\n VALUE = {{ value }}\n LABELS = {{labels }}"
- alert: ElasticsearchHealthyNodes
expr: elasticsearch_cluster_health_number_of_nodes < 3
for: 0m
labels:
severity: 严重
annotations:
summary: Elasticsearch 健康节点数 (instance {{ labels.instance }})
description: "Elasticsearch 健康节点数少于3\n VALUE = {{value }}\n LABELS = {{ labels }}"
- alert: ElasticsearchHealthyDataNodes
expr: elasticsearch_cluster_health_number_of_data_nodeslabels.instance }})
description: "Elasticsearch 健康节点数少于2\n VALUE = {{ value }}\n LABELS = {{labels }}"
- alert: ElasticsearchRelocatingShards
expr: elasticsearch_cluster_health_relocating_shards > 0
for: 0m
labels:
severity: info
annotations:
summary: Elasticsearch relocating shards (instance {{ labels.instance }})
description: "Elasticsearch正在重新定位分片\n VALUE = {{value }}\n LABELS = {{ labels }}"
- alert: ElasticsearchRelocatingShardsTooLong
expr: elasticsearch_cluster_health_relocating_shards>0
for: 15m
labels:
severity: 警告
annotations:
summary: Elasticsearch重新定位分片时间过长 (instance {{labels.instance }})
description: "Elasticsearch 重新定位分片时间超过15min\n VALUE = {{ value }}\n LABELS = {{labels }}"
- alert: ElasticsearchInitializingShards
expr: elasticsearch_cluster_health_initializing_shards > 0
for: 0m
labels:
severity: info
annotations:
summary: Elasticsearch 初始化分片 (instance {{ labels.instance }})
description: "Elasticsearch 正在初始化分片\n VALUE = {{value }}\n LABELS = {{ labels }}"
- alert: ElasticsearchInitializingShardsTooLong
expr: elasticsearch_cluster_health_initializing_shards>0
for: 15m
labels:
severity: 警告
annotations:
summary: Elasticsearch 初始化分片时间过长 (instance {{labels.instance }})
description: "Elasticsearch 初始化分片时间超过 15 min\n VALUE = {{ value }}\n LABELS = {{labels }}"
- alert: ElasticsearchUnassignedShards
expr: elasticsearch_cluster_health_unassigned_shards > 0
for: 0m
labels:
severity: 严重
annotations:
summary: Elasticsearch 存在未分配的分片 (instance {{ labels.instance }})
description: "Elasticsearch 存在未分配的分片\n VALUE = {{value }}\n LABELS = {{ labels }}"
- alert: ElasticsearchPendingTasks
expr: elasticsearch_cluster_health_number_of_pending_tasks>0
for: 15m
labels:
severity: 警告
annotations:
summary: Elasticsearch 有待处理的任务 (instance {{labels.instance }})
description: "Elasticsearch 有待处理的任务。集群运行存在延迟\n VALUE = {{ value }}\n LABELS = {{labels }}"
五、核心监控指标
查询和索引(indexing)性能 内存分配和垃圾回收 主机级别的系统和网络指标 集群健康状态和节点可用性 资源饱和度和相关错误
1.集群健康和节点可用性
通过 cluster healthAPI可以获取集群的健康状况,可以把集群的健康状态当做是集群平稳运行的重要信号,一旦状态发生变化则需要引起重视;API返回的一些重要参数指标及对应的prometheus监控项如下
返回参数 | 备注 | metric name |
---|---|---|
status | 集群状态,green( 所有的主分片和副本分片都正常运行)、yellow(所有的主分片都正常运行,但不是所有的副本分片都正常运行)red(有主分片没能正常运行) | elasticsearch_cluster_health_status |
number_of_nodes/number_of_data_nodes | 集群节点数/数据节点数 | elasticsearch_cluster_health_number_of_nodes/data_nodesactive_primary_shards |
active_primary_shards | 活跃的主分片总数 | elasticsearch_cluster_health_active_primary_shards |
active_shards | 活跃的分片总数(包括复制分片) | elasticsearch_cluster_health_active_shards |
relocating_shards | 当前节点正在迁移到其他节点的分片数量,通常为0,集群中有节点新加入或者退出时该值会增加 | elasticsearch_cluster_health_relocating_shards |
initializing_shards | 正在初始化的分片 | elasticsearch_cluster_health_initializing_shards |
unassigned_shards | 未分配的分片数,通常为0,当有节点的副本分片丢失该值会增加 | elasticsearch_cluster_health_unassigned_shards |
number_of_pending_tasks | 只有主节点能处理集群级元数据的更改(创建索引,更新映射,分配分片等),通过pending-tasks API可以查看队列中等待的任务,绝大部分情况下元数据更改的队列基本上保持为零 | elasticsearch_cluster_health_number_of_pending_tasks |
:grin:依据上述监控项,配置集群状态Singlestat面板,健康状态一目了然
2.主机级别的系统和网络指标
metric name | description |
---|---|
elasticsearch_process_cpu_percent | Percent CPU used by process CPU使用率 |
elasticsearch_filesystem_data_free_bytes | Free space on block device in bytes 磁盘可用空间 |
elasticsearch_process_open_files_count | Open file descriptors ES进程打开的文件描述符 |
elasticsearch_transport_rx_packets_total | Count of packets receivedES节点之间网络入流量 |
elasticsearch_transport_tx_packets_total | Count of packets sentES节点之间网络出流量 |
3.JVM内存和垃圾回收
metric name | description |
---|---|
elasticsearch_jvm_gc_collection_seconds_count | Count of JVM GC runs垃圾搜集数 |
elasticsearch_jvm_gc_collection_seconds_sum | GC run time in seconds垃圾回收时间 |
elasticsearch_jvm_memory_committed_bytes | JVM memory currently committed by area最大使用内存限制 |
elasticsearch_jvm_memory_used_bytes | JVM memory currently used by area 内存使用量 |
主要关注JVM Heap 占用的内存以及JVM GC 所占的时间比例,定位是否有 GC 问题。Elasticsearch依靠垃圾回收来释放堆栈内存,默认当JVM堆栈使用率达到75%的时候启动垃圾回收,添加堆栈设置告警可以判断当前垃圾回收的速度是否比产生速度快,若不能满足需求,可以调整堆栈大小或者增加节点。
4.搜索和索引性能
搜索请求
metric name | description |
---|---|
elasticsearch_indices_search_query_total | query总数 |
elsticsearch_indices_search_query_time_seconds | query时间 |
elasticsearch_indices_search_fetch_total | fetch总数 |
elasticsearch_indices_search_fetch_time_seconds | fetch时间 |
索引请求
metric name | description |
---|---|
elasticsearch_indices_indexing_index_total | Total index calls索引index数 |
elasticsearch_indices_indexing_index_time_seconds_total | Cumulative index time in seconds累计index时间 |
elasticsearch_indices_refresh_total | Total time spent refreshing in second refresh时间 |
elasticsearch_indices_refresh_time_seconds_total | Total refreshess refresh数 |
elasticsearch_indices_flush_total | Total flushes flush数 |
elasticsearch_indices_flush_time_seconds | Cumulative flush time in seconds累计flush时间 |
将时间和操作数画在同一张图上,左边y轴显示时间,右边y轴显示对应操作计数,ops/time查看平均操作耗时判断性能是否异常。通过计算获取平均索引延迟,如果延迟不断增大,可能是一次性bulk了太多的文档。 Elasticsearch通过flush操作将数据持久化到磁盘,如果flush延迟不断增大,可能是磁盘IO能力不足,如果持续下去最终将导致无法索引数据。
5.资源饱和度
metric name | description |
---|---|
elasticsearch_thread_pool_queue_count | Thread Pool operations queued 线程池中排队的线程数 |
elasticsearch_thread_pool_rejected_count | Thread Pool operations rejected 线程池中被拒绝的线程数 |
elasticsearch_indices_fielddata_memory_size_bytes | Field data cache memory usage in bytes fielddata缓存的大小 |
elasticsearch_indices_fielddata_evictions | Evictions from filter cache fielddata缓存的驱逐次数 |
elasticsearch_indices_filter_cache_memory_size_bytes | Filter cache memory usage in bytes 过滤器高速缓存的大小 |
elasticsearch_indices_filter_cache_evictions | Evictions from filter cache 过滤器缓存的驱逐次数 |
elasticsearch_cluster_health_number_of_pending_tasks | Cluster level changes which have not yet been executed 待处理任务数 |
elasticsearch_indices_get_missing_total | Total get missing 丢失文件的请求数 |
elasticsearch_indices_get_missing_time_seconds | Total time of get missing in seconds 文档丢失的请求时间 |
通过采集以上指标配置视图,Elasticsearch节点使用线程池来管理线程对内存和CPU使用。可以通过请求队列和请求被拒绝的情况,来确定节点是否够用。 每个Elasticsearch节点都维护着很多类型的线程池。一般来讲,最重要的几个线程池是搜索(search),索引(index),合并(merger)和批处理(bulk)。 每个线程池队列的大小代表着当前节点有多少请求正在等待服务。一旦线程池达到最大队列大小(不同类型的线程池的默认值不一样),后面的请求都会被线程池拒绝。
- 我的微信
- 这是我的微信扫一扫
- 我的微信公众号
- 我的微信公众号扫一扫