I am posting a more general question, after having found I may have more issues than low disk space:
My issue is that my ES server crashes occasionally, and cannot figure out why.
I want to ensure reliability at least of days, and if error occur, restart the instance automatically.
Which best practices could I follow to debug ES on a small server instance, using a single node?
This is what I am looking at: (useful resource - https://www.datadoghq.com/blog/elasticsearch-unassigned-shards/)
Check on available disk space - optimise server operations with elasticsearch : addressing low disk watermarks and authentication failures
Check on ES log (
/var/log/elasticsearch
):
...
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:536) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:490) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450) [netty-transport-4.1.6.Final.jar:4.1.6.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873) [netty-common-4.1.6.Final.jar:4.1.6.Final]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
Caused by: org.elasticsearch.action.NoShardAvailableActionException
... 60 more
[2020-05-12T15:05:56,874][INFO ][o.e.c.r.a.AllocationService] [awesome3-master] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[en-awesome-wiki][2]] ...]).
[2020-05-12T15:10:48,998][DEBUG][o.e.a.a.c.a.TransportClusterAllocationExplainAction] [awesome3-master] explaining the allocation for [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false], found shard [[target-validation][4], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2020-05-12T15:05:54.260Z], delayed=false, allocation_status[no_attempt]]]
- I spotted somewhere a shared allocation error. So I check:
curl -s 'localhost:9200/_cat/allocation?v'
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
15 616.2mb 10.6gb 12.5gb 23.1gb 45 127.0.0.1 127.0.0.1 awesome3-master
15 UNASSIGNED
What does this mean ? Are the indexed duplicated in more replicas (see below) ?
- I check
curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1425 100 1425 0 0 5137 0 --:--:-- --:--:-- --:--:-- 5144
target-validation 4 r UNASSIGNED CLUSTER_RECOVERED
target-validation 2 r UNASSIGNED CLUSTER_RECOVERED
target-validation 1 r UNASSIGNED CLUSTER_RECOVERED
target-validation 3 r UNASSIGNED CLUSTER_RECOVERED
target-validation 0 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 4 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 2 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 1 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 3 r UNASSIGNED CLUSTER_RECOVERED
it-tastediscovery-expo 0 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki 4 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki 2 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki 1 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki 3 r UNASSIGNED CLUSTER_RECOVERED
en-awesome-wiki 0 r UNASSIGNED CLUSTER_RECOVERED
and here I have a question: is ES trying to create new replicas each time an error is failing the system ?
- So I look at an explaination:
curl -XGET localhost:9200/_cluster/allocation/explain?pretty
{
"shard" : {
"index" : "target-validation",
"index_uuid" : "ONFPE7UQQzWjrhG0ztlSdw",
"id" : 4,
"primary" : false
},
"assigned" : false,
"shard_state_fetch_pending" : false,
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2020-05-12T15:05:54.260Z",
"delayed" : false,
"allocation_status" : "no_attempt"
},
"allocation_delay_in_millis" : 60000,
"remaining_delay_in_millis" : 0,
"nodes" : {
"Ynm6YG-MQyevaDqT2n9OeA" : {
"node_name" : "awesome3-master",
"node_attributes" : { },
"store" : {
"shard_copy" : "AVAILABLE"
},
"final_decision" : "NO",
"final_explanation" : "the shard cannot be assigned because allocation deciders return a NO decision",
"weight" : 9.5,
"decisions" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated on the same node id [Ynm6YG-MQyevaDqT2n9OeA] on which it already exists"
}
]
}
}
}
Now, I would like to better understand what a shard is and what ES is attempting to do.
Should I delete unused replicas?
And finally, what should I do to test the service is "sufficiently" reliable ?
Kindly let me know if there are best practices to follow for debugging ES and tuning server.
My constraint are a small server and would be happy if server won't crash, just take a little bit longer.
EDIT
Found this very useful question :
Shards and replicas in Elasticsearch
and this answer may offer a solution: https://stackoverflow.com/a/50641899/305883
Before testing it out as an answer, could you kindly help to figure out if / how back-up the indexes and estimating correct parameters?
I run 1 single server and assume, given the above configurations, number_of_shards
should be 1 (1 single machine) and max number_of_replicas
could be 2 (disk size should handle it) :
curl -XPUT 'localhost:9200/sampleindex?pretty' -H 'Content-Type: application/json' -d '
{
"settings":{
"number_of_shards":1,
"number_of_replicas":2
}
}'