How to reduce the time it takes to refresh Terraform's state?

Question

Most of the AWS infrastructure of the company I work for is described and managed using Terraform.

We have several different services including containerized back-ends and CDN'ed front-ends.

From Route53 domains and namespaces to ELBs, ECS and CloudFront, there is a lot going on.

One of the issues that is happening right now is that, mostly because of the Route53 DNS, checking, refreshing and validating a terraform state takes a long time.

And this is the problem we're trying to solve:

How to drastically reduce the time it takes for tf state to be refreshed/checked?

Moving it into a separate repository apparently is not a good idea because that would make all the Route53 related variables inaccessible or, possibly, outdated.

Is all of your Terraform configuration in a single place? Best practices state that you should split things up to only group stuff that needs to be applied at the same time to minimise blast radius, make it easier to make concurrent changes while not breaking state and reduce the time it takes Terraform to refresh and build the dependency graph. — ydaetskcoR, Jul 14 '18 at 09:46
How many resources (quantified as 1 resource per line in the output of a plan) and how much time does a plan take? Example: I have 250+ resources 20ish of which are route53 stuff - it takes < 20 seconds to do a plan. Are the times you're seeing on par with that? — Shorn, Jul 14 '18 at 23:38
@ydaetskcoR We have a single repo describing the infrastructure for the whole company. There are different .tf files to keep resources organized according to what makes sense to us. But they're still read "all at once". — Jonathan Soifer, Jul 15 '18 at 17:08
@Shorn I would have to compare my data with yours, thank for providing it. Although the number of Route53 resources we have is at least one order of magnitude greater than that. — Jonathan Soifer, Jul 15 '18 at 17:10
Single repo is fine but generally you'd only put .tf files in the same directory if they _needed_ to be applied at the same time. You should then split your directory structure up in ways mentioned in other Terraform project structure questions on Stackoverflow — ydaetskcoR, Jul 16 '18 at 08:48
Where are storing the state? If you are storing it on S3, you can use an optimize IO bucket, plus split the infrastructure into components and to access your state as little as possible you store your state output on SSM Parameter Store. If you keep the output on SSM, you rarely access the state and things run much faster. If you multiple accounts, you might want to put your state on different buckets as well and enable cross account access to communicate between states using remote state. — victor m, Dec 25 '18 at 20:52

score 3 · Answer 1 · edited Apr 11 '19 at 20:25

You should break the state out into component sub-states that have sensible logical distinctions, such as "front-end", "caching" or whatever makes sense for how your company organizes and classifies infrastructure.

In terms of making variables accessible, you can declare other states as data sources and pull from them (assuming that they have valid outputs for the values you are interested in).

score 3 · Answer 2 · answered Oct 28 '20 at 08:11

I came here because I was researching a similar issue. It seems that TF is terrible at graph-walking, so the more interconnected your stuff is, the worse it performs. I have a ball of yarn with 2300 resources. that takes 49 minutes to plan on a machine with enough memory and processors to run at parallelism 10 without peaking. A third is spent refreshing the state, and that can probably not be reduced since it's bound by the AWS CLI calls. But the third spent before state refresh and the third after seem to be mostly TF faffing about in the graph (based on logs).

I found some discussion that would seem to indicate that the structure of your code might influence the planning time dramatically, specifically the use of for_each (link #1 & #2). Since my code base makes heavy use of this, I found that interesting. YMMV ;)

Oh, and obviously, if you can reduce the size of the stack by splitting it, you should see a superlinear reduction in planning time, but I'm guessing that people coming here have already tried that ;) — mhvelplund, Oct 28 '20 at 08:12

How to reduce the time it takes to refresh Terraform's state?

2 Answers2