Debugging Elasticsearch and tuning on small server, single node

Question

I am posting a more general question, after having found I may have more issues than low disk space:

optimise server operations with elasticsearch : addressing low disk watermarks and authentication failures

My issue is that my ES server crashes occasionally, and cannot figure out why.

I want to ensure reliability at least of days, and if error occur, restart the instance automatically.

Which best practices could I follow to debug ES on a small server instance, using a single node?

This is what I am looking at: (useful resource - https://www.datadoghq.com/blog/elasticsearch-unassigned-shards/)

Check on available disk space - optimise server operations with elasticsearch : addressing low disk watermarks and authentication failures
Check on ES log (/var/log/elasticsearch):

    ...
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:536) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:490) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873) [netty-common-4.1.6.Final.jar:4.1.6.Final]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
    Caused by: org.elasticsearch.action.NoShardAvailableActionException
        ... 60 more
    [2020-05-12T15:05:56,874][INFO ][o.e.c.r.a.AllocationService] [awesome3-master] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[en-awesome-wiki][2]] ...]).
    [2020-05-12T15:10:48,998][DEBUG][o.e.a.a.c.a.TransportClusterAllocationExplainAction] [awesome3-master] explaining the allocation for [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false], found shard [[target-validation][4], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2020-05-12T15:05:54.260Z], delayed=false, allocation_status[no_attempt]]]

I spotted somewhere a shared allocation error. So I check:

    curl -s 'localhost:9200/_cat/allocation?v'
    shards disk.indices disk.used disk.avail disk.total disk.percent host      ip        node
        15      616.2mb    10.6gb     12.5gb     23.1gb           45 127.0.0.1 127.0.0.1 awesome3-master
        15                                                                               UNASSIGNED

What does this mean ? Are the indexed duplicated in more replicas (see below) ?

I check

curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1425  100  1425    0     0   5137      0 --:--:-- --:--:-- --:--:--  5144
target-validation      4 r UNASSIGNED CLUSTER_RECOVERED
target-validation      2 r UNASSIGNED CLUSTER_RECOVERED
target-validation      1 r UNASSIGNED CLUSTER_RECOVERED
target-validation      3 r UNASSIGNED CLUSTER_RECOVERED
target-validation      0 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 4 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 2 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 1 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 3 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 0 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki        4 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki        2 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki        1 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki        3 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki        0 r UNASSIGNED CLUSTER_RECOVERED

and here I have a question: is ES trying to create new replicas each time an error is failing the system ?

So I look at an explaination:

curl -XGET localhost:9200/_cluster/allocation/explain?pretty
{
  "shard" : {
    "index" : "target-validation",
    "index_uuid" : "ONFPE7UQQzWjrhG0ztlSdw",
    "id" : 4,
    "primary" : false
  },
  "assigned" : false,
  "shard_state_fetch_pending" : false,
  "unassigned_info" : {
    "reason" : "CLUSTER_RECOVERED",
    "at" : "2020-05-12T15:05:54.260Z",
    "delayed" : false,
    "allocation_status" : "no_attempt"
  },
  "allocation_delay_in_millis" : 60000,
  "remaining_delay_in_millis" : 0,
  "nodes" : {
    "Ynm6YG-MQyevaDqT2n9OeA" : {
      "node_name" : "awesome3-master",
      "node_attributes" : { },
      "store" : {
        "shard_copy" : "AVAILABLE"
      },
      "final_decision" : "NO",
      "final_explanation" : "the shard cannot be assigned because allocation deciders return a NO decision",
      "weight" : 9.5,
      "decisions" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated on the same node id [Ynm6YG-MQyevaDqT2n9OeA] on which it already exists"
        }
      ]
    }
  }
}

Now, I would like to better understand what a shard is and what ES is attempting to do.

Should I delete unused replicas?

And finally, what should I do to test the service is "sufficiently" reliable ?

Kindly let me know if there are best practices to follow for debugging ES and tuning server.

My constraint are a small server and would be happy if server won't crash, just take a little bit longer.

EDIT

Found this very useful question :

Shards and replicas in Elasticsearch

and this answer may offer a solution: https://stackoverflow.com/a/50641899/305883

Before testing it out as an answer, could you kindly help to figure out if / how back-up the indexes and estimating correct parameters?

I run 1 single server and assume, given the above configurations, number_of_shards should be 1 (1 single machine) and max number_of_replicas could be 2 (disk size should handle it) :

curl -XPUT 'localhost:9200/sampleindex?pretty' -H 'Content-Type: application/json' -d '
{
  "settings":{
    "number_of_shards":1,
    "number_of_replicas":2
  }
}'

Debugging Elasticsearch and tuning on small server, single node

0 Answers0

Linked