Is there a smarter way to reindex elasticsearch?

Question

I ask because our search is in a state of flux as we work things out, but each time we make a change to the index (change tokenizer or filter, or number of shards/replicas), we have to blow away the entire index and re-index all our Rails models back into Elasticsearch ... this means we have to factor in downtime to re-index all our records.

Is there a smarter way to do this that I'm not aware of?

gertas · Accepted Answer · 2017-08-25T11:06:11.977

I think @karmi makes it right. However let me explain it a bit simpler. I needed to occasionally upgrade production schema with some new properties or analysis settings. I recently started to use the scenario described below to do live, constant load, zero-downtime index migrations. You can do that remotely.

Here are steps:

Assumptions:

You have index real1 and aliases real_write, real_read pointing to it,
the client writes only to real_write and reads only from real_read ,
_source property of document is available.

1. New index

Create real2 index with new mapping and settings of your choice.

2. Writer alias switch

Using following bulk query switch write alias.

curl -XPOST 'http://esserver:9200/_aliases' -d '
{
    "actions" : [
        { "remove" : { "index" : "real1", "alias" : "real_write" } },
        { "add" : { "index" : "real2", "alias" : "real_write" } }
    ]
}'

This is atomic operation. From this time real2 is populated with new client's data on all nodes. Readers still use old real1 via real_read. This is eventual consistency.

3. Old data migration

Data must be migrated from real1 to real2, however new documents in real2 can't be overwritten with old entries. Migrating script should use bulk API with create operation (not index or update). I use simple Ruby script es-reindex which has nice E.T.A. status:

$ ruby es-reindex.rb http://esserver:9200/real1 http://esserver:9200/real2

UPDATE 2017 You may consider new Reindex API instead of using the script. It has lot of interesting features like conflicts reporting etc.

4. Reader alias switch

Now real2 is up to date and clients are writing to it, however they are still reading from real1. Let's update reader alias:

curl -XPOST 'http://esserver:9200/_aliases' -d '
{
    "actions" : [
        { "remove" : { "index" : "real1", "alias" : "real_read" } },
        { "add" : { "index" : "real2", "alias" : "real_read" } }
    ]
}'

5. Backup and delete old index

Writes and reads go to real2. You can backup and delete real1 index from ES cluster.

Done!

Thanks. The es-reindex script just copies from an existing index, if the data needs updating from the database, you'd use a river import in this spot, right? — Kevin, Jul 25 '13 at 00:28
Actually I haven't been using river yet. I would just change es-reindex script to my needs that could e.g. update the payload. Keep in mind that if you introduce conflicting change *live* migration may not be possible. — gertas, Jul 25 '13 at 07:34
Appears that the script copies mappings as well. Does it override mappings created in step 1? — cmonkey, Sep 25 '13 at 01:59
@cmonkey No, mapping is created only when it doesn't exist (es-reindex.rb:90) or when it is deleted first with `-r` option. — gertas, Sep 26 '13 at 18:32
We've expanded that helpful script into a full gem: https://github.com/mojolingo/es-reindex — Justin Aiken, Jan 09 '15 at 20:11
This does not address the requirement that some applications have that writes followed by searches show the updated write. You have to write to both indexes. You can have two write aliases (one with fanout to two indexes is not allowed by ES). However you get a race between your application or background DB->ES data pusher and your reindex process when you decide to delete the second alias. Any writes will get an error. Your app also doesn't know when it should start multiple writes. — Paul S, Jul 08 '17 at 00:17

score 30 · Answer 2 · answered Dec 13 '12 at 09:21

30

Yes, there are smarter ways how to re-index your data without downtime.

First, never, ever use the "final" index name as your real index name. So, if you'd like to name your index "articles", don't use that name as a physical index, but create an index such as "articles-2012-12-12" or "articles-A", "articles-1", etc.

Second, create an alias "alias" pointing to that index. Your application will then use this alias, so you'll never need to manually change the index name, restart the application, etc.

Third, when you want or need to re-index the data, re-index them into a different index, let's say "articles-B" -- all the tools in Tire's indexing toolchaing support you here.

When you're done, point the alias to the new index. In this way, you not only minimize downtime (there isn't any), you also have a safe snapshot: if you somehow mess up the indexing into the new index, you can just switch back to the old one, until you resolve the issue.

answered Dec 13 '12 at 09:21

karmi

12,608
3
30
41

My issue is that I index all my models in one index, I wonder, is there a way to do the re-index to a different index in this scenario? Will "rake environment tire:import CLASS='Article' INDEX='articles-2011-05" actually index to 'articles-2011-05' when I have index_name specified as 'articles' in my Rails model? – concept47 Dec 13 '12 at 21:16
3

@karmi..I have a doubt here. You have said to point the alias to the new index, after migrating data to the new index. But, during the migration of data, if there is no downtime, there will be newer data inserted to the old index and the new index will not have this new data. How can we avoid this loss of data? – rubyprince Jul 12 '13 at 10:38
@rubyprince That's a warranted doubt. You either have to disable/buffer updates during migration, or replay the updates on the new index. – karmi Jul 14 '13 at 21:29
2

Good article describing this here: http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/ – Adrian Carr Dec 26 '13 at 19:22
1

@karmi... I am concerned with the data loss pointed out by rubyprince, do we have a better way other than reindex the newer data inserted to the old index? – user2756589 Feb 20 '15 at 08:01
We do this at large scale, this has proved to work very well for us. – Kenny Cason Jan 20 '17 at 17:39

score 3 · Answer 3 · edited Jun 20 '20 at 09:12

Wrote up a blog post about how I handled reindexing with no downtime recently. Takes some time to figure out all the little things that need to be in place to do so. Hope this helps!

https://summera.github.io/infrastructure/2016/07/04/reindexing-elasticsearch.html

To summarize:

Step 1: Prepare New Index

Create your new index with your new mapping. This can be on the same instance of Elasticsearch or on a brand new instance.

Step 2: Keep Indexes Up To Date

While you're reindexing you want to keep both your new and old indexes up to date. For a write operation, this can be done by sending the write operation to a background worker on both the new and old index.

Deletes are a bit trickier because there is a race condition between deleting and reindexing the record into the new index. So, you'll want to keep track of the records that need to be deleted during your reindex and process these when you are finished. If you aren't performing many deletes, another way would be to eliminate the possibility of a delete during your reindex.

Step 3: Perform Reindexing

You’ll want to use a scrolled search for reading the data and bulk API for inserting. Since after Step 2 you'll be writing new and updated documents to the new index in the background, you want to make sure you do NOT update existing documents in the new index with your bulk API requests.

This means that the operation you want for your bulk API requests is create, not index. From the documentation: “create will fail if a document with the same index and type exists already, whereas index will add or replace a document as necessary”. The main point here is you do not want old data from the scrolled search snapshot to overwrite new data in the new index.

There's a great script on github to help you with this process: es-reindex.

Step 4: Switch Over

Once you’re finished reindexing, it’s time to switch your search over to the new index. You’ll want to turn deletes back on or process the enqueued delete jobs for the new index. You may notice that searching the new index is a bit slow at first. This is because Elasticsearch and the JVM need time to warm up.

Perform any code changes you need so your application starts searching the new index. You can continue writing to the old index incase you run into problems and need to rollback. If you feel this is unnecessary, you can stop writing to it.

Step 5: Clean Up

At this point you should be completely transitioned to the new index. If everything is going well, perform any necessary cleanup such as:

Delete the old index host if it’s different from the new
Remove serialization code related to your old index

You really should be including relevant information from your blog post into the answer. If your blog post needs to go away for whatever reason in the future, this answer becomes useless. — Chris Peters, Jul 07 '16 at 15:36
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/low-quality-posts/12935958) — Vasseurth, Jul 07 '16 at 19:17
@Vasseurth thanks for the recommendation. My answer is updated with a summary from the article. — Ari, Jul 07 '16 at 19:58

score 2 · Answer 4 · answered Dec 13 '12 at 00:26

2

Maybe create another index, and reindex all the data onto that one, and then make the switch when it's done re-indexing ?

answered Dec 13 '12 at 00:26

Emil Hajric

700
1
9
25

hmmm ... in our case, we have a rails app with all the indexes hard coded in, it would be hard to change it over and then change it back. I wondered about firing up a new node, to do the re-index, but elastic search redistributes shards into new nodes that you create :\ – concept47 Dec 13 '12 at 00:37