prometheus应用之监控elasticsearch

一、安装

elasticsearch_exporter与ES集群是分开独立的,不需要对原有的ES集群(可能有很多个)做任何修改,不需要重启,只要能访问es集群即可。可以单独部署在一台服务器上项目地址:elasticsearch_exporter
[root@prod-es-master01 ~]# mkdir /opt/soft/
[root@prod-es-master01 ~]# cd /opt/soft/
[root@prod-es-master01 soft]# wget https://github.com/prometheus-community/elasticsearch_exporter/releases/download/v1.3.0/elasticsearch_exporter-1.3.0.linux-amd64.tar.gz
[root@prod-es-master01 soft]# tar -zxvf elasticsearch_exporter-1.3.0.linux-amd64.tar.gz -C /usr/local/
[root@prod-es-master01 soft]# cd /usr/local/
[root@prod-es-master01 local]# mv elasticsearch_exporter-1.3.0.linux-amd64 elasticsearch_exporter

二、启动exporter

1、常用参数解释
参数选项    参数说明
–es.uri     默认http://localhost:9200,连接到的Elasticsearch节点的地址(主机和端口)
–es.all     默认flase,如果为true,则查询群集中所有节点的统计信息,而不仅仅是查询我们连接到的节点
–es.cluster_settings    默认flase,如果为true,请在统计信息中查询集群设置
–es.indices 默认flase,如果为true,则查询统计信息以获取集群中的所有索引
–es.indices_settings    默认flase,如果为true,则查询集群中所有索引的设置统计信息
–es.shards  默认flase,如果为true,则查询集群中所有索引的统计信息,包括分片级统计信息
–es.snapshots   默认flase,如果为true,则查询集群快照的统计信息

2、Systemd管理

vim /lib/systemd/system/es_exporter.service

[Unit]
Description=The es_exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/elasticsearch_exporter/elasticsearch_exporter --es.all --es.indices --es.cluster_settings --es.indices_settings --es.shards --es.snapshots  --es.uri http://user:passwd@ip:9200
Restart=on-failure

[Install]
WantedBy=multi-user.target
注意:es集群开启了x-pack验证则需要使用用户名和密码,反之不需要.

3、启动服务

systemctl daemon-reload
systemctl start es_exporter
systemctl enable es_exporter

4、检查数据
现在通过curl去访问9114端口去检查数据

[root@prod-es-master01 ~]# curl 127.0.0.1:9114/metrics

三、配置监控Job

1、配置prometheus监控

~]# vim /usr/local/prometheus/prometheus.yml

  ## elasticsearch Cluster
  - job_name: "elasticsearch Cluster"
    scrape_interval: 60s
    scrape_timeout: 60s
    static_configs:
    - targets: ['192.168.66.86:9114']
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*)\:9114'
        target_label: 'nodeip'
        replacement: '1'
      - source_labels: [__address__]
        regex: '(.*)\:9114'
        target_label: 'hostname'
        replacement: '1'

#重载prometheus配置
~]# promtool check config /usr/local/prometheus/prometheus.yml
~]# curl -X POST http://127.0.0.1:9090/-/reload

2、添加Grafana监控面板
面板id 13071 13072 13073 13074 2322

四、告警规则创建

groups:
  - name: es-cluster-exporter.rules
    rules:
    - alert: ElasticsearchHeapUsageTooHigh
      expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90
      for: 2m
      labels:
        severity: 严重
      annotations:
        summary: Elasticsearch堆使用率过高 (instance {{ labels.instance }})
        description: "Elasticsearch堆使用率高于90%\n  VALUE = {{value }}\n  LABELS = {{ labels }}"

    - alert: ElasticsearchHeapUsageWarning
      expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100>80
      for: 2m
      labels:
        severity: 警告
      annotations:
        summary: Elasticsearch堆使用率警告 (instance {{labels.instance }})
        description: "Elasticsearch堆使用率高于80%\n  VALUE = {{ value }}\n  LABELS = {{labels }}"

    - alert: ElasticsearchDiskOutOfSpace
      expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 10
      for: 0m
      labels:
        severity: 严重
      annotations:
        summary: Elasticsearch 磁盘空间不足 (instance {{ labels.instance }})
        description: "磁盘使用率超过90%\n  VALUE = {{value }}\n  LABELS = {{ labels }}"

    - alert: ElasticsearchDiskOutOfSpace
      expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100labels.instance }})
        description: "磁盘使用率超过80%\n  VALUE = {{ value }}\n  LABELS = {{labels }}"

    - alert: ElasticsearchClusterRed
      expr: elasticsearch_cluster_health_status{color="red"} == 1
      for: 0m
      labels:
        severity: 严重
      annotations:
        summary: Elasticsearch 集群状态 Red (instance {{ labels.instance }})
        description: "Elastic Cluster Red 状态\n  VALUE = {{value }}\n  LABELS = {{ labels }}"

    - alert: ElasticsearchClusterYellow
      expr: elasticsearch_cluster_health_status{color="yellow"} == 1
      for: 0m
      labels:
        severity: 警告
      annotations:
        summary: Elasticsearch 集群状态 Yellow (instance {{labels.instance }})
        description: "Elastic Cluster Yellow 状态\n  VALUE = {{ value }}\n  LABELS = {{labels }}"

    - alert: ElasticsearchHealthyNodes
      expr: elasticsearch_cluster_health_number_of_nodes < 3
      for: 0m
      labels:
        severity: 严重
      annotations:
        summary: Elasticsearch 健康节点数 (instance {{ labels.instance }})
        description: "Elasticsearch 健康节点数少于3\n  VALUE = {{value }}\n  LABELS = {{ labels }}"

    - alert: ElasticsearchHealthyDataNodes
      expr: elasticsearch_cluster_health_number_of_data_nodeslabels.instance }})
        description: "Elasticsearch 健康节点数少于2\n  VALUE = {{ value }}\n  LABELS = {{labels }}"

    - alert: ElasticsearchRelocatingShards
      expr: elasticsearch_cluster_health_relocating_shards > 0
      for: 0m
      labels:
        severity: info
      annotations:
        summary: Elasticsearch relocating shards (instance {{ labels.instance }})
        description: "Elasticsearch正在重新定位分片\n  VALUE = {{value }}\n  LABELS = {{ labels }}"

    - alert: ElasticsearchRelocatingShardsTooLong
      expr: elasticsearch_cluster_health_relocating_shards>0
      for: 15m
      labels:
        severity: 警告
      annotations:
        summary: Elasticsearch重新定位分片时间过长 (instance {{labels.instance }})
        description: "Elasticsearch 重新定位分片时间超过15min\n  VALUE = {{ value }}\n  LABELS = {{labels }}"

    - alert: ElasticsearchInitializingShards
      expr: elasticsearch_cluster_health_initializing_shards > 0
      for: 0m
      labels:
        severity: info
      annotations:
        summary: Elasticsearch 初始化分片 (instance {{ labels.instance }})
        description: "Elasticsearch 正在初始化分片\n  VALUE = {{value }}\n  LABELS = {{ labels }}"

    - alert: ElasticsearchInitializingShardsTooLong
      expr: elasticsearch_cluster_health_initializing_shards>0
      for: 15m
      labels:
        severity: 警告
      annotations:
        summary: Elasticsearch 初始化分片时间过长 (instance {{labels.instance }})
        description: "Elasticsearch 初始化分片时间超过 15 min\n  VALUE = {{ value }}\n  LABELS = {{labels }}"

    - alert: ElasticsearchUnassignedShards
      expr: elasticsearch_cluster_health_unassigned_shards > 0
      for: 0m
      labels:
        severity: 严重
      annotations:
        summary: Elasticsearch 存在未分配的分片 (instance {{ labels.instance }})
        description: "Elasticsearch 存在未分配的分片\n  VALUE = {{value }}\n  LABELS = {{ labels }}"

    - alert: ElasticsearchPendingTasks
      expr: elasticsearch_cluster_health_number_of_pending_tasks>0
      for: 15m
      labels:
        severity: 警告
      annotations:
        summary: Elasticsearch 有待处理的任务 (instance {{labels.instance }})
        description: "Elasticsearch 有待处理的任务。集群运行存在延迟\n  VALUE = {{ value }}\n  LABELS = {{labels }}"

五、核心监控指标

查询和索引(indexing)性能 内存分配和垃圾回收 主机级别的系统和网络指标 集群健康状态和节点可用性 资源饱和度和相关错误

1.集群健康和节点可用性
通过​ ​cluster health​​API可以获取集群的健康状况,可以把集群的健康状态当做是集群平稳运行的重要信号,一旦状态发生变化则需要引起重视;API返回的一些重要参数指标及对应的prometheus监控项如下

返回参数 备注 metric name
status 集群状态,green( 所有的主分片和副本分片都正常运行)、yellow(所有的主分片都正常运行,但不是所有的副本分片都正常运行)red(有主分片没能正常运行) elasticsearch_cluster_health_status
number_of_nodes/number_of_data_nodes 集群节点数/数据节点数 elasticsearch_cluster_health_number_of_nodes/data_nodesactive_primary_shards
active_primary_shards 活跃的主分片总数 elasticsearch_cluster_health_active_primary_shards
active_shards 活跃的分片总数(包括复制分片) elasticsearch_cluster_health_active_shards
relocating_shards 当前节点正在迁移到其他节点的分片数量,通常为0,集群中有节点新加入或者退出时该值会增加 elasticsearch_cluster_health_relocating_shards
initializing_shards 正在初始化的分片 elasticsearch_cluster_health_initializing_shards
unassigned_shards 未分配的分片数,通常为0,当有节点的副本分片丢失该值会增加 elasticsearch_cluster_health_unassigned_shards
number_of_pending_tasks 只有主节点能处理集群级元数据的更改(创建索引,更新映射,分配分片等),通过​​pending-tasks​​ API可以查看队列中等待的任务,绝大部分情况下元数据更改的队列基本上保持为零 elasticsearch_cluster_health_number_of_pending_tasks

:grin:依据上述监控项,配置集群状态Singlestat面板,健康状态一目了然

2.主机级别的系统和网络指标

metric name description
elasticsearch_process_cpu_percent Percent CPU used by process CPU使用率
elasticsearch_filesystem_data_free_bytes Free space on block device in bytes 磁盘可用空间
elasticsearch_process_open_files_count Open file descriptors ES进程打开的文件描述符
elasticsearch_transport_rx_packets_total Count of packets receivedES节点之间网络入流量
elasticsearch_transport_tx_packets_total Count of packets sentES节点之间网络出流量

3.JVM内存和垃圾回收

metric name description
elasticsearch_jvm_gc_collection_seconds_count Count of JVM GC runs垃圾搜集数
elasticsearch_jvm_gc_collection_seconds_sum GC run time in seconds垃圾回收时间
elasticsearch_jvm_memory_committed_bytes JVM memory currently committed by area最大使用内存限制
elasticsearch_jvm_memory_used_bytes JVM memory currently used by area 内存使用量
主要关注JVM Heap 占用的内存以及JVM GC 所占的时间比例,定位是否有 GC 问题。Elasticsearch依靠垃圾回收来释放堆栈内存,默认当JVM堆栈使用率达到75%的时候启动垃圾回收,添加堆栈设置告警可以判断当前垃圾回收的速度是否比产生速度快,若不能满足需求,可以调整堆栈大小或者增加节点。

4.搜索和索引性能

搜索请求
metric name description
elasticsearch_indices_search_query_total query总数
elsticsearch_indices_search_query_time_seconds query时间
elasticsearch_indices_search_fetch_total fetch总数
elasticsearch_indices_search_fetch_time_seconds fetch时间
索引请求
metric name description
elasticsearch_indices_indexing_index_total Total index calls索引index数
elasticsearch_indices_indexing_index_time_seconds_total Cumulative index time in seconds累计index时间
elasticsearch_indices_refresh_total Total time spent refreshing in second refresh时间
elasticsearch_indices_refresh_time_seconds_total Total refreshess refresh数
elasticsearch_indices_flush_total Total flushes flush数
elasticsearch_indices_flush_time_seconds Cumulative flush time in seconds累计flush时间
将时间和操作数画在同一张图上,左边y轴显示时间,右边y轴显示对应操作计数,ops/time查看平均操作耗时判断性能是否异常。通过计算获取平均索引延迟,如果延迟不断增大,可能是一次性bulk了太多的文档。 Elasticsearch通过flush操作将数据持久化到磁盘,如果flush延迟不断增大,可能是磁盘IO能力不足,如果持续下去最终将导致无法索引数据。

5.资源饱和度

metric name description
elasticsearch_thread_pool_queue_count Thread Pool operations queued 线程池中排队的线程数
elasticsearch_thread_pool_rejected_count Thread Pool operations rejected 线程池中被拒绝的线程数
elasticsearch_indices_fielddata_memory_size_bytes Field data cache memory usage in bytes fielddata缓存的大小
elasticsearch_indices_fielddata_evictions Evictions from filter cache fielddata缓存的驱逐次数
elasticsearch_indices_filter_cache_memory_size_bytes Filter cache memory usage in bytes 过滤器高速缓存的大小
elasticsearch_indices_filter_cache_evictions Evictions from filter cache 过滤器缓存的驱逐次数
elasticsearch_cluster_health_number_of_pending_tasks Cluster level changes which have not yet been executed 待处理任务数
elasticsearch_indices_get_missing_total Total get missing 丢失文件的请求数
elasticsearch_indices_get_missing_time_seconds Total time of get missing in seconds 文档丢失的请求时间
通过采集以上指标配置视图,Elasticsearch节点使用线程池来管理线程对内存和CPU使用。可以通过请求队列和请求被拒绝的情况,来确定节点是否够用。 每个Elasticsearch节点都维护着很多类型的线程池。一般来讲,最重要的几个线程池是搜索(search),索引(index),合并(merger)和批处理(bulk)。 每个线程池队列的大小代表着当前节点有多少请求正在等待服务。一旦线程池达到最大队列大小(不同类型的线程池的默认值不一样),后面的请求都会被线程池拒绝。
  • 我的微信
  • 这是我的微信扫一扫
  • weinxin
  • 我的微信公众号
  • 我的微信公众号扫一扫
  • weinxin
avatar

发表评论

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: