Terraform uses state files to track the resources it creates back to resource definitions in your *.tf files. Each deployment has its own state. State is stored according to the backend configured for the deployment. Terraform uses a local backend for storing state on the local filesystem by default, which is what we’ve been using for part 1 and part 2 of the terraform skeleton series.1 This works fine for a simple demonstration but is insufficient for production use because:

  1. You can only run terraform commands on your deployments from that one machine
  2. There are no built-in backups on your state files; if they’re lost, you’re going to have a bad day
  3. There’s no built-in versioning of your state files, which can be handy for recovering from more advanced terraform operations gone wrong

While you can commit the state files into your git repository for achieving distribution across team members, backups, and versioning, that isn’t a good idea because:

  1. State files can include sensitive information you might not want in version control
  2. If multiple team members make changes affecting the same state file, you’ll wind up with a mess of multiple incomplete state files needing to be merged

Remote state storage is built in to terraform and solves these problems, making it easy for teams to collaborate on the same deployments. Terragrunt provides further enhancements that make working with remote state even easier. Today, we’ll add remote state to the skeleton using terragrunt, S3, and DynamoDB. This will give us a simple foundation to build on.

Goals

  1. Our skeleton stores state remotely such that multiple teammates can run terraform commands
  2. State files are stored in a directory structure matching our deployments
  3. Locks are in place around state manipulation to avoid corruption due to concurrent terraform command executions
  4. State is stored securely, with encryption, versioning, and logging enabled
  5. Remote state configuration is DRY - defined once and reused across deployments

If you prefer to jump to the end, the code implementing the final result is available on GitHub.

Setup

You will need:

  1. An AWS account
  2. Credentials for that account configured in the terminal used for running terraform and terragrunt commands

The S3 Backend

Remote state can be accomplished in many different ways with terraform. The teams I support tend to have an Amazon Web Services footprint already, and therefore I typically use terraform’s S3 backend.

The S3 backend stores your state files in S3 and retrieves them for stateful terraform commands. This meets the distribution, versioning, and encryption requirements we require. To avoid corruption from concurrent terraform commands, the S3 backend uses a DynamoDB table to manage lock files. Stateful terraform commands first obtain a lock from the DynamoDB table, effectively single-threading commands operating on the same state file.

We’ll focus on using the S3 backend today.

Backend Configuration with Terraform

Terraform backends are configured using a block in the stack’s *.tf files:

terraform {
  backend "s3" {
    bucket = "terraform-skeleton-state"
    key    = "something/unique/to/the/stack"
    region = "us-east-1"
  }
}

This inline configuration is less than ideal. First, you have to configure the backend for every stack definition (i.e., in every directory under our modules/stacks), leading to a lot of duplication. Second, you cannot use interpolation in the backend block to configure it using variables, locals, or data source attributes. A lack of interpolation can be problematic if you want to use different buckets for different environments for example, since we have multiple deployments (e.g., deployments/app/dev/test-stack, deployments/app/test/test-stack) sharing the same modules/stack/app/test-stack files.

Terraform provides several ways to work around these limitations through partial configuration. This allows you to omit properties from the backend block and instead provide them through an out-of-band mechanism such as a configuration file, a command-line option, or interactively on the command-line. With partial configuration, you could, for example, omit the bucket property in the configuration above, and instead pass it through one of the mentioned mechanisms to achieve different buckets for different environments.

Partial configuration, although a step in the right direction, still requires you to figure out how you’re going to manage the backend configuration properties. If using configuration files, where will those be stored and distributed to the team? How will you ensure every terraform command receives the necessary command-line properties with the correct values? This more or less forces you to use a build tool for your terraform commands (e.g., a makefile) to achieve consistency/reproducibility. While a build tool isn’t necessarily a bad thing, it strikes me as unusual to require it for remote state, a seemingly basic and fundamental capability.

Fortunately, terragrunt gives us a way to make remote state usage easy for anyone on the team.

Backend Configuration with Terragrunt

With terragrunt, you can configure your backend within a terragrunt.hcl, or in our case root.hcl. Terragrunt then generates the necessary terraform backend configuration based on what you specify. Configuring remote state inside root.hcl means every deployment will receive the backend configuration by way of terragrunt’s include mechanism. This removes all duplication of remote state configuration from the stacks.

Additionally, the remote state block supports interpolation, allowing each deployment to share common parts of the remote state configuration while having other parts be unique. For instance, every deployment can share the same bucket for their state files but use a different prefix within that bucket.

Using terragrunt’s remote state configuration is well documented by Gruntwork. Here’s how we can add it to our skeleton.

Implementation

Add a remote_state block to our root.hcl to generate an s3 backend configuration:

--- a/deployments/root.hcl
+++ b/deployments/root.hcl
@@ -39,6 +40,24 @@ locals {
 # environment variables
 inputs = local.merged_config

+remote_state {
+  backend = "s3"
+  generate = {
+    path      = "backend.tf"
+    if_exists = "overwrite"
+  }
+  config = {
+    bucket  = "terraform-skeleton-state"
+    region  = "us-east-1"
+    encrypt = true
+
+    key = "${dirname(local.relative_deployment_path)}/${local.stack}.tfstate"
+
+    dynamodb_table            = "terraform-skeleton-state-locks"
+    accesslogging_bucket_name = "terraform-skeleton-state-logs"
+  }
+}
+

This approach uses:

  • A single bucket, terraform-skeleton-state, for all deployment state files
  • A key/prefix unique to each deployment based on the relative path from the root.hcl file
  • A corresponding DynamoDB lock table, terraform-skeleton-state-locks
  • A logging bucket, terraform-skeleton-state-logs, for logging all S3 access requests to the state bucket

How This Works

Terragrunt executes terraform commands from a .terragrunt-cache directory. Before executing terraform, terragrunt populates the cache with:

  1. Any files in the current deployment directory (location of your terragrunt.hcl file)
  2. The stack files in the terraform.source directory specified in our root.hcl
  3. Files from generate blocks defined in the HCL files

The last step above is what translates the generate block defined in root.hcl to a backend.tf file for each stack containing the appropriate terraform backend configuration.2

Bucket/Table Initialization

By using the remote_state block in our HCL files, we allow terragrunt to manage the creation of the S3 bucket and DynamoDB table for us.

Terragrunt will:

  • Automatically create both buckets and/or lock table for you if it does not exist
  • Configure the state bucket (but not the logs bucket) with
    • Versioning and encryption enabled
    • Public access disabled
    • Enforcing TLS-only access
  • Enable encryption on the lock table

Activating Remote State

Switching from local state to remote state requires running terragrunt init on each deployment.

The first init will prompt you to create the buckets and/or dynamo table if they don’t exist:

Initializing remote state for the s3 backend
Remote state S3 bucket terraform-skeleton-state does not
exist or you don't have permissions to access it.
Would you like Terragrunt to create it? (y/n)

Every init will then prompt you to copy your existing local state to the new remote backend:

[terragrunt] 2020/12/13 10:04:56 Initializing remote state for the s3 backend
[terragrunt] 2020/12/13 10:04:57 Running command: terraform init

Initializing the backend...
Do you want to copy existing state to the new backend?
  Pre-existing state was found while migrating the previous "local" backend to the
  newly configured "s3" backend. No existing state was found in the newly
  configured "s3" backend. Do you want to copy this state to the new "s3"
  backend? Enter "yes" to copy and "no" to start with an empty state.

  Enter a value:

Finally, although this shouldn’t be necessary, I’ve had to rm -rf the .terragrunt-cache directory inside each deployment for terragrunt commands to work after the migration.

After activating remote state for all deployments, we can list our state bucket and see the state files nicely organized:

➜ aws s3 ls --recursive terraform-skeleton-state
2020-12-12 07:38:39       1154 app/dev/test-stack.tfstate
2020-12-13 10:22:49       1148 app/prod/test-stack.tfstate
2020-12-13 10:22:34       1154 app/stage/test-stack.tfstate

Limitations

Using terragrunt’s remote_state block has several advantages:

  • It’s easy
  • It includes sensible security defaults for the state bucket and lock table

While it is good enough for today’s skeleton, there are some limitations of the terragrunt approach worth covering:

  1. Terragrunt lacks security defaults on the log bucket

    If terragrunt creates the log bucket, it will not have encryption enabled and it will not have public access explicitly blocked. For this reason, you should strongly consider self-management of the log bucket. I tend to do this with a CloudFormation stack.

  2. Disabling auto-creation of the state bucket and lock table is broken3

    For a single team owning all infrastructure auto-creation is likely safe enough. Once I start distributing ownership of pieces of infrastructure to different teams, I want to control where state files are stored as much as possible, which means avoiding a situation where a team accidentally creates a new state bucket or lock table. Consistent state file storage makes it easier to audit infrastructure for compliance with team and organization standards using tools like terraform-compliance. Easier auditing in turn makes it more palatable to push down infrastructure ownership responsibilities.

  3. Terragrunt doesn’t offer full control over the credentials used to access the terraform state

    The remote_state block has a few fields for specifying a role_arn or AWS profile to use for remote state access, but you can’t control things like IAM session tags, transitive tag usage, or assume role policies.

  4. Terragrunt doesn’t offer full control over all fields on the buckets and table

    While terragrunt applies sensible security defaults, you can’t control everything using its remote_state block. For example, you’ll need to self-manage if you need to specify specific KMS keys for encryption of the bucket or table. Similarly, if you want to set up replication of your terraform state bucket, self-management is the way to go.

What’s next?

The skeleton now supports multiple developers working on the infrastructure as a team, sharing a state file stored in S3 with contention resolved through a DynamoDB lock table. State is encrypted and versioned, and access to it is logged. Remote state is configured once, in the root.hcl, and reused across our stacks.

Thus far, terragrunt has been running with whatever AWS credentials were configured in the shell at the time of execution.

This isn’t ideal in a team setting because:

  1. It could lead to “works on my machine problems” if developers have different configurations
  2. It encourages running terraform with administrative-level permissions all the time for all developers working with it
  3. Terraform state is sensitive and should be protected from modification except by a well-defined set of roles

The next entry in the terraform skeleton series will take a first step towards addressing these issues.

Footnotes

  1. Terragrunt stores local state in a terraform.tfstate file located underneath the .terragrunt-cache directory it creates within our deployment directory. 

  2. The terragrunt remote state approach we cover here uses the newer generate option, which was added in terragrunt v0.22.0. Before that version, terragrunt wouldn’t generate a backend.tf file but would instead pass CLI arguments to the terraform command, making use of terraform’s partial configuration implementation. The generate approach has several advantages over the old method. The most notable is that previously, you had to include an empty backend configuration block in every stack, whereas with the generate approach you do not. 

  3. Terragrunt has a disable_init attribute on the remote_state block, which will prevent auto-creation but, as described by this open issue, also completely disables backend initialization.