1

I have a 4 nodes & 5 shards Elastic Search (0.90.3) cluster. On restart, I see 4 of 5 shards unassigned and cluster status is red. So I am assuming the way it was restarted was not right. Each node was issued a kill (SIGKILL) command in 30 seconds interval. Meaning some node was killed, 30 seconds later some other node from the remaining 3 was killed & so on.

I tried this solution to have shards reassigned but nothing worked until I manual assigned a primary shard to the cluster using this approach. But manual assigning of primary shard resets the data for the shard resulting in loss.

How do I avoid getting into the unassigned shard problem? And If I am stuck with that problem what is the way to recover without data loss?

Community
  • 1
  • 1
Prasanna
  • 3,483
  • 7
  • 41
  • 71
  • Instead of calling a kill command on the process, I would typically shut the node down. Less likely to cause any upset to the system as it will follow a shutdown procedure - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-shutdown.html – Nathan Smith Apr 10 '14 at 09:53
  • @Nate Thanks, that is something I am planning to do. Also Can you show some light as why would the problem of unassigned shards happen? I honestly do not know what caused this to happen. Yes SIGKILL might have caused it but I am not able to see what ensued after SIGKILL for this to happen. – Prasanna Apr 10 '14 at 18:12
  • If you're trying to avoid downtime, shouldn't you be restarting each node before shutting down the next one? – Avish Apr 10 '14 at 20:44
  • @Avish You are right. But unfortunately that is not the case right now. Lesson learnt is to do the update one node at a time. But I am really curious to know as what happened to ElasticSearch that it was not able to form the cluster again on startup. – Prasanna Apr 10 '14 at 20:53

1 Answers1

2

The correct way to restart a cluster is to do a rolling restart using the shutdown API.

This works by:

  1. Disabling shard allocation
  2. Restarting one node (cluster goes yellow)
  3. Wait until it rejoins the cluster
  4. Re-enable shard allocation
  5. Wait until shards are reallocated (cluster goes green)
  6. Repeat on other nodes.

You may want to increase indices.recovery.max_bytes_per_sec and cluster.routing.allocation.node_concurrent_recoveries to speed up step 5. Whilst the cluster is yellow, some shards will be unassigned (because they were on the node that was restarted), but this not a problem. Reads and writes will still work as normal.

Wilfred Hughes
  • 26,027
  • 13
  • 120
  • 177
  • 1
    It seems this is no longer relevant as the 5.5 documentation https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-nodes-shutdown.html#_rolling_restart_of_nodes_full_cluster_restart says `The _shutdown API has been removed` – vkats Aug 22 '17 at 13:08