Terraform is terrible
Remote state storage on AWS uses 2 resources?! Client design for CI jobs is an afterthought. Breaking change version every quarter. That's Terraform.
Here is my experience from running and upgrading a small Terraform project. As you might have guessed it was not great, but I'll try to focus on facts rather than opinions (even though some might sneak in). It will be mainly about the CLI client and it's versioning schema, but also some complaints about state management. I'm big proponent of CI/CD and Infrastructure as Code and I will try to explain how Terraform does not fit the picture.
The project is small, but manages 8 clusters. Contrary to typical case it's a SaaS: Atlas service that offers managed MongoDB on AWS in our case. Every project is using some of 7 modules that represent Atlas resources with necessary AWS bindings (i.e. secrets). When we started 0.12 was the newest version, so upgrade to 0.13 is part of the story.
Since the issues are about Terraform client mostly, the IaaS or SaaS used is not that relevant. However Atlas plugin had to change how internal configuration structure, which only added insult to injury.
State
Let's start with state, which is common pain point while working with Terraform. To some extend I understand the decision of using state, but it is inherently difficult to manage.
I dare to say that remote state has all the disadvantages of cache, but not many advantages. Sure, for a team working on a project remote state is a must. Actually I'd like a solution that would enforce using remote state, but there's none - we have to rely on state config will be copied over from existing project.
State config on AWS using S3 and DynamoDB
State configuration is another weak point. Typically S3 backend is used to store state, that's fine. But if you want to make it safe from many people overwriting each other changes by running terraform apply
at the same time you need additional configuration for locks or mutex. You should definitely use that!
In AWS realm DynamoDB is needed for locks. My guess is S3 does not have atomic operation to obtain a lock, so key-value database is needed. What I don't understand is: why I need both? Why not store state in DynamoDB directly? The state contents is just JSON, right? If you happen to work on Terraform let me know, please. For me that's a big overlook.
Anyway, below is sample state config using S3 and DynamoDB - both are needed to make it safe. Feel free to copy :)
terraform {
backend "s3" {
bucket = "terraform-state-prod"
key = "mongo-atlas/resources/DEVELOPMENT-resources.state"
region = "eu-west-1"
dynamodb_table = "terraform-state-prod"
}
}
Remember to use remote state (with locks!) if you're working on a team or using Terraform on CI/CD pipeline. This is lesser of two evils (everyone having its own state!).
Discrepancies of state contents
Using locks avoid one way state can be corrupted: running more than one terraform apply
at the same time. Other issues arise from HCL code, state and system discrepancies. Let's look at those.
First issue is the difference between Terraform state and the actual state of the system. Sometimes it's called configuration drift. One case is when someone adds something in the admin console, then it's invisible for Terraform. We had that for IP access list, but since all the entries are independent there was no conflict. The only drawback was that we no longer have single source of truth, so better avoid such practice and either manage things manually or via Terraform code.
Bigger problem is when someone changes something on the server manually, but the change is not reflected in Terraform. In such case it will be visible during terraform plan
as change. You have to make a decision if to override it via terraform apply
or reconcile to HCL code. In case of the latter there will be playing a detective investigation on who and why made a config change. My recommendation: avoid those cases at all cost and manage everything via Terraform!
Default values are stored in state
Another interesting case is when state has items that are not in the Terraform code. I guess that happens, because Terraform stores in state the evaluated state from the system (for us: Atlas service), with the created resource IDs and values for default parameters. We were quite surprised that plan says bi_connector
will be removed. What puzzled me even more: bi_connector
was not in the code!
What happened was: bi_connector
is optional, so default values were stored in state. Since Atlas plugin was upgraded they've removed bi_connector
attribute and replaced it with bi_connector_config
to adhere to new HCL parser. That's an example how syntax breaking changes affect you in a surprising way, but more on that later.
Other cases of state corruption
Terraform stores a client version in the state too. It has got interesting implications: if you change the state with newer client (running terraform apply
) it will enforce others (i.e. CI worker) to use the same version too. We've mitigated that by wrapping Terraform CLI in script that manages the version, so everyone will use exactly the same.
On top of all that state can get into messed up state when terraform apply
process terminates (i.e. by hitting Ctrl+C
). To be honest I haven't experienced that myself, just heard that on Terraform training. However I can guess why: there's no transactional support in applying changes by Terraform. I understand why, distributed transactions are hard, but still I don't like it.
Client
My biggest surprise was how bad the Terraform client is. Don't get me wrong: I love that it's a standalone Go binary, so it was easy to manage specific version. The Makefile script downloads Terraform binary, then runs terraform init
and terraform plan
for specific cluster. On the other hand I don't want to create such wrappers, just use whatever version is compatible. That is not possible due to versioning policy and instability of the tool, but more on that later.
Managing many workspaces
As I mentioned we have 8 clusters, each having resources configuration in its own directory. I needed to use the downloaded binary and run it in the cluster dir, easy. After looking at help there's a terraform apply [options] DIR
param, so tried to use that. Unfortunately I've got an error about missing param values, like it would not read the .tfvars
files from the same directory.
Apparently the DIR
at the end does not work as you'd expect, instead Hashicorp added -chdir
global param to handle such use case. Param -chdir
landed in 0.14 version, but I was upgrading to 0.13, so clever workaround of cd cluster/DEVELOPMENT && ../../terraform apply
was necessary.
Since Makefile is excellent at managing dependency graph I thought: lets run terraform init
only when necessary. That was quite easy, unless init fails for some reason and leaves local workspace partially initialized. Then only way to manage that in Makefile was to remove the partially initialized workspace and start over. That is how the overly familiar Makefile cleanup
target was born, which nukes .terraform
dir for every workspace and removes the binary too. I wish I better understood how to local Terraform workspace is created, because I sense there's a room for improvement :)
Running apply in CI/CD pipeline
Terrafrom client has been made for interactive use from the start. That's why it will force you to type yes
in to confirm terraform apply
changes and provide a prompt for missing parameter values (instead of just exiting with an error). To overcome that you have to resort to CLI params like -auto-approve
or -var 'foo=bar'
for example. Those are necessary for any CI/CD pipeline, but such UX design looks like an afterthought. The UX with -auto-approve
on CI is crumbled in another way: terraform apply -auto-approve
does not present changes that are going to be made. Why!? How to verify what kind of changes were made by looking at pipeline console then?
The solution is to run terraform plan
just before apply, so we'll have a record of changes to the infra. There's another problem with that: there might be changes in the infra system state between plan
and apply
actions. Let's imagine a pull request workflow, where plan
is run for a branch to verify that code diff has the desired effect on the infra and apply
is run after merge to main branch. There could be hours, even days, between change plan and the actual application of them.
One solution to that problem would be to run plan
command just before apply
again on the main branch (after merge). Ideally there should be an option to prevent apply
in case of change plan taking an undesired direction.
Second solution could be using plan -out
parameter, which saves change plan to file, so apply
can pick up the exact change plan that was generated before. To be honest I haven't tried that yet, but I keep wondering why -out
is not the default setting and why apply does not require change plan as input. Such design decisions keep me puzzled.
Versioning
Our project has started around August 2020, so Terraform 0.12 was the most recent version. I was happy about that, because 0.12 introduced syntax changes to how params and should be quoted. In reality many of the existing solutions used the old syntax, which still worked in 0.12 version. As a result code became a mix of new and old syntax, which was not a problem until the upgrade to 0.13 started throwing deprecation warnings.
Of course I've learned about terraform 0.12upgrade
command soon, but it kept throwing syntax errors on projects that had mix of old and new syntax. Our option was either to downgrade everything to 0.11 (which sounds silly) or upgrade syntax manually. I went with the latter, which was quite a lot of getting back and forth, because deprecation warnings do not show you all the occurrences of the problem, but just the first one and "there are 55 more" message. Not useful.
Of course terraform 0.12upgrade
is only available in 0.12 version, so even if I wanted 0.13 first had to download older one to try to upgrade code. That was the reason that led me to creating Makefile wrapper for Terraform binary.
Why bother with 0.13 upgrade after all? Well, we needed for_each
syntax feature which was only in 0.13 version for modules. A side note: is it only me or adding new syntax for some cases in 0.12.6
version feels wired? The for_each
feature was required to manage many Atlas user roles in a dynamic attribute.
The biggest problem is that even patch versions (x.y.Z) can introduce new features, so it might not be enough to have any 0.13.x version, but rather bind to the specific 0.13.7
for all clients. In addition client version is stored in state, so using newer by accident can enforce an upgrade for everyone.
Managing Terraform versions is really cumbersome, to the extend that tools like terraform-switcher exist. I did not wanted yet another interactive CLI tool for our CI server, that's why I've went with Makefile that can be used on CI and locally. In my opinion Makefile is vastly misunderstood (and hence underused tool), but that's a story for another time.
Since the start of our project (August 2020) Hashicorp released 3 new major versions. The newest one is 0.15.4
and we've started on 0.12.5
back then. It means that over the last 6 months 3 breaking change releases have been made, one every second month! That is rapid change rate for something that should be stable and boring like infrastructure. The other surprising fact is that initial Terraform release was in 2014, so the project is almost 7 years old.
Parting thoughts
So far it was mostly about facts and my experiences around using and upgrading Terraform. Now time for a little opinion and thoughts about the project. My experience with managing Terraform is just last few months, previously I've mostly used Terraform setup by someone else.
Based on the rate of breaking changes in the last 6 months I'm worried about the stability of the product. In my opinion it should have a big BETA badge to warn about that, even despite being 7 years in development. Sure the 0.x.y
versioning scheme might indicate that it's not ready for prime time and API breaking changes will happen. I understand that, even SemVer allows breaking changes for minor version bump (0.x.0) if it's not 1.0.0
yet. For me that looks like a lazy policy on the Hashicorp side that after years of development they still have a policy open to breaking changes. I thought even Facebook dropped "move fast and break things" attitude by now...
Even though development seems to be dynamic I see that legacy seems to creep in already. Just look at the DIR
command line parameter that is going to be replaced by -chdir
which is clearly stated in the commit message. Even for the Terraform documentation it is going to be quite a lot of work. What about all the solutions and workarounds existing the the wild, i.e. on StackOverlow or blog posts? This is going to have similar impact as using old syntax (0.11 and older) in new project, because someone found a solution somewhere. Without clearly communicated versioning policy it will never get in order.
The biggest surprise to me is that Terraform has ADOPT status on TechRadar April 2019 from ThoughtWorks. Maybe the timing plays a role here, since it was way before the major change in syntax in 0.12 version shipped mid-2020? I wonder if Terraform is still state of the art, or there are better Infrastructure as Code solutions recommended by ThoughtWorks or others?