内容隐藏

一、安装

elasticsearch_exporter与ES集群是分开独立的，不需要对原有的ES集群(可能有很多个)做任何修改，不需要重启，只要能访问es集群即可。可以单独部署在一台服务器上项目地址：elasticsearch_exporter

[root@prod-es-master01 ~]# mkdir /opt/soft/
[root@prod-es-master01 ~]# cd /opt/soft/
[root@prod-es-master01 soft]# wget https://github.com/prometheus-community/elasticsearch_exporter/releases/download/v1.3.0/elasticsearch_exporter-1.3.0.linux-amd64.tar.gz
[root@prod-es-master01 soft]# tar -zxvf elasticsearch_exporter-1.3.0.linux-amd64.tar.gz -C /usr/local/
[root@prod-es-master01 soft]# cd /usr/local/
[root@prod-es-master01 local]# mv elasticsearch_exporter-1.3.0.linux-amd64 elasticsearch_exporter

二、启动exporter

1、常用参数解释
参数选项    参数说明
–es.uri     默认http://localhost:9200，连接到的Elasticsearch节点的地址（主机和端口）
–es.all     默认flase，如果为true，则查询群集中所有节点的统计信息，而不仅仅是查询我们连接到的节点
–es.cluster_settings    默认flase，如果为true，请在统计信息中查询集群设置
–es.indices 默认flase，如果为true，则查询统计信息以获取集群中的所有索引
–es.indices_settings    默认flase，如果为true，则查询集群中所有索引的设置统计信息
–es.shards  默认flase，如果为true，则查询集群中所有索引的统计信息，包括分片级统计信息
–es.snapshots   默认flase，如果为true，则查询集群快照的统计信息

2、Systemd管理

vim /lib/systemd/system/es_exporter.service

[Unit]
Description=The es_exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/elasticsearch_exporter/elasticsearch_exporter --es.all --es.indices --es.cluster_settings --es.indices_settings --es.shards --es.snapshots  --es.uri http://user:passwd@ip:9200
Restart=on-failure

[Install]
WantedBy=multi-user.target

注意：es集群开启了x-pack验证则需要使用用户名和密码，反之不需要.

3、启动服务

systemctl daemon-reload
systemctl start es_exporter
systemctl enable es_exporter

4、检查数据
现在通过curl去访问9114端口去检查数据

[root@prod-es-master01 ~]# curl 127.0.0.1:9114/metrics

三、配置监控Job

1、配置prometheus监控

~]# vim /usr/local/prometheus/prometheus.yml

  ## elasticsearch Cluster
  - job_name: "elasticsearch Cluster"
    scrape_interval: 60s
    scrape_timeout: 60s
    static_configs:
    - targets: ['192.168.66.86:9114']
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*)\:9114'
        target_label: 'nodeip'
        replacement: ' $1' - source_labels: [__address__] regex: '(.*)\:9114' target_label: 'hostname' replacement: '$ 1'

#重载prometheus配置
~]# promtool check config /usr/local/prometheus/prometheus.yml
~]# curl -X POST http://127.0.0.1:9090/-/reload

2、添加Grafana监控面板
面板id 13071 13072 13073 13074 2322

四、告警规则创建

groups:
  - name: es-cluster-exporter.rules
    rules:
    - alert: ElasticsearchHeapUsageTooHigh
      expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90
      for: 2m
      labels:
        severity: 严重
      annotations:
        summary: Elasticsearch堆使用率过高 (instance {{  $labels.instance }}) description: "Elasticsearch堆使用率高于90%\n VALUE = {{$ value }}\n  LABELS = {{  $labels }}" - alert: ElasticsearchHeapUsageWarning expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100>80 for: 2m labels: severity: 警告 annotations: summary: Elasticsearch堆使用率警告 (instance {{$ labels.instance }})
        description: "Elasticsearch堆使用率高于80%\n  VALUE = {{  $value }}\n LABELS = {{$ labels }}"

    - alert: ElasticsearchDiskOutOfSpace
      expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 10
      for: 0m
      labels:
        severity: 严重
      annotations:
        summary: Elasticsearch 磁盘空间不足 (instance {{  $labels.instance }}) description: "磁盘使用率超过90%\n VALUE = {{$ value }}\n  LABELS = {{  $labels }}" - alert: ElasticsearchDiskOutOfSpace expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100labels.instance }}) description: "磁盘使用率超过80%\n VALUE = {{ value }}\n LABELS = {{labels }}" - alert: ElasticsearchClusterRed expr: elasticsearch_cluster_health_status{color="red"} == 1 for: 0m labels: severity: 严重 annotations: summary: Elasticsearch 集群状态 Red (instance {{ labels.instance }}) description: "Elastic Cluster Red 状态\n VALUE = {{value }}\n LABELS = {{ labels }}" - alert: ElasticsearchClusterYellow expr: elasticsearch_cluster_health_status{color="yellow"} == 1 for: 0m labels: severity: 警告 annotations: summary: Elasticsearch 集群状态 Yellow (instance {{labels.instance }}) description: "Elastic Cluster Yellow 状态\n VALUE = {{ value }}\n LABELS = {{labels }}" - alert: ElasticsearchHealthyNodes expr: elasticsearch_cluster_health_number_of_nodes < 3 for: 0m labels: severity: 严重 annotations: summary: Elasticsearch 健康节点数 (instance {{ labels.instance }}) description: "Elasticsearch 健康节点数少于3\n VALUE = {{value }}\n LABELS = {{ labels }}" - alert: ElasticsearchHealthyDataNodes expr: elasticsearch_cluster_health_number_of_data_nodeslabels.instance }}) description: "Elasticsearch 健康节点数少于2\n VALUE = {{ value }}\n LABELS = {{labels }}" - alert: ElasticsearchRelocatingShards expr: elasticsearch_cluster_health_relocating_shards > 0 for: 0m labels: severity: info annotations: summary: Elasticsearch relocating shards (instance {{ labels.instance }}) description: "Elasticsearch正在重新定位分片\n VALUE = {{value }}\n LABELS = {{ labels }}" - alert: ElasticsearchRelocatingShardsTooLong expr: elasticsearch_cluster_health_relocating_shards>0 for: 15m labels: severity: 警告 annotations: summary: Elasticsearch重新定位分片时间过长 (instance {{labels.instance }}) description: "Elasticsearch 重新定位分片时间超过15min\n VALUE = {{ value }}\n LABELS = {{labels }}" - alert: ElasticsearchInitializingShards expr: elasticsearch_cluster_health_initializing_shards > 0 for: 0m labels: severity: info annotations: summary: Elasticsearch 初始化分片 (instance {{ labels.instance }}) description: "Elasticsearch 正在初始化分片\n VALUE = {{value }}\n LABELS = {{ labels }}" - alert: ElasticsearchInitializingShardsTooLong expr: elasticsearch_cluster_health_initializing_shards>0 for: 15m labels: severity: 警告 annotations: summary: Elasticsearch 初始化分片时间过长 (instance {{labels.instance }}) description: "Elasticsearch 初始化分片时间超过 15 min\n VALUE = {{ value }}\n LABELS = {{labels }}" - alert: ElasticsearchUnassignedShards expr: elasticsearch_cluster_health_unassigned_shards > 0 for: 0m labels: severity: 严重 annotations: summary: Elasticsearch 存在未分配的分片 (instance {{ labels.instance }}) description: "Elasticsearch 存在未分配的分片\n VALUE = {{value }}\n LABELS = {{ labels }}" - alert: ElasticsearchPendingTasks expr: elasticsearch_cluster_health_number_of_pending_tasks>0 for: 15m labels: severity: 警告 annotations: summary: Elasticsearch 有待处理的任务 (instance {{labels.instance }}) description: "Elasticsearch 有待处理的任务。集群运行存在延迟\n VALUE = {{ value }}\n LABELS = {{labels }}"$

五、核心监控指标

查询和索引（indexing）性能内存分配和垃圾回收主机级别的系统和网络指标集群健康状态和节点可用性资源饱和度和相关错误

1.集群健康和节点可用性
通过 cluster healthAPI可以获取集群的健康状况，可以把集群的健康状态当做是集群平稳运行的重要信号，一旦状态发生变化则需要引起重视；API返回的一些重要参数指标及对应的prometheus监控项如下

返回参数	备注	metric name
status	集群状态，green（所有的主分片和副本分片都正常运行）、yellow（所有的主分片都正常运行，但不是所有的副本分片都正常运行）red（有主分片没能正常运行）	elasticsearch_cluster_health_status
number_of_nodes/number_of_data_nodes	集群节点数/数据节点数	elasticsearch_cluster_health_number_of_nodes/data_nodesactive_primary_shards
active_primary_shards	活跃的主分片总数	elasticsearch_cluster_health_active_primary_shards
active_shards	活跃的分片总数（包括复制分片）	elasticsearch_cluster_health_active_shards
relocating_shards	当前节点正在迁移到其他节点的分片数量，通常为0，集群中有节点新加入或者退出时该值会增加	elasticsearch_cluster_health_relocating_shards
initializing_shards	正在初始化的分片	elasticsearch_cluster_health_initializing_shards
unassigned_shards	未分配的分片数，通常为0，当有节点的副本分片丢失该值会增加	elasticsearch_cluster_health_unassigned_shards
number_of_pending_tasks	只有主节点能处理集群级元数据的更改(创建索引，更新映射，分配分片等)，通过pending-tasks API可以查看队列中等待的任务，绝大部分情况下元数据更改的队列基本上保持为零	elasticsearch_cluster_health_number_of_pending_tasks

:grin:依据上述监控项，配置集群状态Singlestat面板，健康状态一目了然

2.主机级别的系统和网络指标

metric name	description
elasticsearch_process_cpu_percent	Percent CPU used by process CPU使用率
elasticsearch_filesystem_data_free_bytes	Free space on block device in bytes 磁盘可用空间
elasticsearch_process_open_files_count	Open file descriptors ES进程打开的文件描述符
elasticsearch_transport_rx_packets_total	Count of packets receivedES节点之间网络入流量
elasticsearch_transport_tx_packets_total	Count of packets sentES节点之间网络出流量

3.JVM内存和垃圾回收

metric name	description
elasticsearch_jvm_gc_collection_seconds_count	Count of JVM GC runs垃圾搜集数
elasticsearch_jvm_gc_collection_seconds_sum	GC run time in seconds垃圾回收时间
elasticsearch_jvm_memory_committed_bytes	JVM memory currently committed by area最大使用内存限制
elasticsearch_jvm_memory_used_bytes	JVM memory currently used by area 内存使用量

主要关注JVM Heap 占用的内存以及JVM GC 所占的时间比例，定位是否有 GC 问题。Elasticsearch依靠垃圾回收来释放堆栈内存，默认当JVM堆栈使用率达到75%的时候启动垃圾回收，添加堆栈设置告警可以判断当前垃圾回收的速度是否比产生速度快，若不能满足需求，可以调整堆栈大小或者增加节点。

4.搜索和索引性能

搜索请求

metric name	description
elasticsearch_indices_search_query_total	query总数
elsticsearch_indices_search_query_time_seconds	query时间
elasticsearch_indices_search_fetch_total	fetch总数
elasticsearch_indices_search_fetch_time_seconds	fetch时间

索引请求

metric name	description
elasticsearch_indices_indexing_index_total	Total index calls索引index数
elasticsearch_indices_indexing_index_time_seconds_total	Cumulative index time in seconds累计index时间
elasticsearch_indices_refresh_total	Total time spent refreshing in second refresh时间
elasticsearch_indices_refresh_time_seconds_total	Total refreshess refresh数
elasticsearch_indices_flush_total	Total flushes flush数
elasticsearch_indices_flush_time_seconds	Cumulative flush time in seconds累计flush时间

将时间和操作数画在同一张图上，左边y轴显示时间，右边y轴显示对应操作计数，ops/time查看平均操作耗时判断性能是否异常。通过计算获取平均索引延迟，如果延迟不断增大，可能是一次性bulk了太多的文档。 Elasticsearch通过flush操作将数据持久化到磁盘，如果flush延迟不断增大，可能是磁盘IO能力不足，如果持续下去最终将导致无法索引数据。

5.资源饱和度

metric name	description
elasticsearch_thread_pool_queue_count	Thread Pool operations queued 线程池中排队的线程数
elasticsearch_thread_pool_rejected_count	Thread Pool operations rejected 线程池中被拒绝的线程数
elasticsearch_indices_fielddata_memory_size_bytes	Field data cache memory usage in bytes fielddata缓存的大小
elasticsearch_indices_fielddata_evictions	Evictions from filter cache fielddata缓存的驱逐次数
elasticsearch_indices_filter_cache_memory_size_bytes	Filter cache memory usage in bytes 过滤器高速缓存的大小
elasticsearch_indices_filter_cache_evictions	Evictions from filter cache 过滤器缓存的驱逐次数
elasticsearch_cluster_health_number_of_pending_tasks	Cluster level changes which have not yet been executed 待处理任务数
elasticsearch_indices_get_missing_total	Total get missing 丢失文件的请求数
elasticsearch_indices_get_missing_time_seconds	Total time of get missing in seconds 文档丢失的请求时间

通过采集以上指标配置视图，Elasticsearch节点使用线程池来管理线程对内存和CPU使用。可以通过请求队列和请求被拒绝的情况，来确定节点是否够用。每个Elasticsearch节点都维护着很多类型的线程池。一般来讲，最重要的几个线程池是搜索（search），索引（index），合并（merger）和批处理（bulk）。每个线程池队列的大小代表着当前节点有多少请求正在等待服务。一旦线程池达到最大队列大小（不同类型的线程池的默认值不一样），后面的请求都会被线程池拒绝。