0

I have an Elasticsearch cluster of five nodes. I have the configuration on each set the same with 5 shards per index and 4 replicas. The idea is that every node will have every shard.

Four of my nodes have five shards. One node has ALL of them primary. One node has NOTHING. And then, of course, I have 5 unallocated shards.

I reload a new index every day and this is exactly how it allocates shards every time.

The goal here is to figure out why the one node gets nothing. That's bad.

It would be easy for me to ask why this is happening - and if anyone knows, that would be fantastic. But as I can't seem to find ANYTHING online or in the documentation to explain this, I must ask, perhaps, how I can diagnose it? Any clues? Anything I can look at to give a clue here?

EDIT TO ADD - here is my configuration. Every machine looks like this (with the exception of the machine name and the discovery, of course):

#
# Server-specific settings for cluster domainiq-es
#
cluster.name: domainiq-es
node.name: "Mesa-01"
discovery.zen.ping.unicast.hosts: ["m1plfinddev03.prod.mesa1.gdg", "m1plfinddev04.prod.mesa1.gdg", "p3plfinddev03.prod.phx3.gdg", "p3plfinddev04.prod.phx3.gdg"]
#
# The following configuration items should be the same for all ES servers
#
node.master: true
node.data: true
index.number_of_shards: 5
index.number_of_replicas: 4
index.store.type: mmapfs
index.memory.index_buffer_size: 30%
index.translog.flush_threshold_ops: 25000
index.refresh_interval: 30s
bootstrap.mlockall: true
gateway.recover_after_nodes: 4
gateway.recover_after_time: 2m
gateway.expected_nodes: 5
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.timeout: 10s
discovery.zen.ping.retries: 3
discovery.zen.ping.interval: 15s
discovery.zen.ping.multicast.enabled: false
index.search.slowlog.threshold.query.warn: 500ms
index.search.slowlog.threshold.query.info: 200ms
index.search.slowlog.threshold.query.debug: 199ms
index.search.slowlog.threshold.query.trace: 198ms
index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.trace: 200ms
index.indexing.slowlog.threshold.index.warn: 10s
index.indexing.slowlog.threshold.index.info: 5s
index.indexing.slowlog.threshold.index.debug: 2s
index.indexing.slowlog.threshold.index.trace: 500ms
  • What explicit settings have you put on each node's elasticsearch.yml file? Are these nodes on separate machines? Do you have anything suspicious in logs? If nodes are on separate machines are you sure each is aware of the other nodes? – Andrei Stefan Nov 13 '14 at 18:59
  • Same settings on each, I will edit my post, above, to include it. All nodes are on individual machines. Nothing suspicious in logs, that's where I checked first. All nodes are aware of others. That is, I can do a query on the node with no data and it returns valid results by getting the data from another node, as expected. – Christopher Ambler Nov 13 '14 at 19:05
  • 1
    I added [this](http://stackoverflow.com/a/23816954/2785358) answer on how I went about diagnosing something like this... basically try to manually route one of the shards and see why it won't move – Alcanzar Nov 13 '14 at 19:46

1 Answers1

1

Thanks to Alcanzar, in the comment above, I do believe the issue here is one he saw - different versions. The node that will not accept shards is running one version earlier than the others.

I will upgrade everything to 1.4 this weekend and likely see this go away. Makes total sense now.