In the past articles of this series I’ve described a simple setup of a web-app on AWS. I included AWS Lambda for some backend functionality, S3 to store static website files, CloudFront as a caching router, and Hetzner cloud DNS to publish the service under a domain. For a hobby project, this is enough, but when we develop software professionally, we usually have multiple stages of the complete system running in parallel.

At the very least, we have two stages:

“production” (short “prod”) is the system used by the actual users.
“staging” contains different instances of the same component. The same code, the same database schema, but generally different data (i.e., test-data). This stage exists so that we can test and possibly break things without affecting the actual users of the system. Often, we run a set of end-to-end tests against staging, which test the system before everything gets deployed to production.
I worked on a project where “staging” was used by dedicated testers to click through test-plans. To stop developers from interfering with testers, we deployed another “development” stage (short “dev”) as a developer testing-ground.
In another project, we were using “review” environments. We were using GitLab and Merge Requests (MR) in our workflow; the CI/CD pipeline would deploy a dedicated clone of the application for every MR.

So far, my “placeruler” app just has a single stage: Production

Let’s change that.

Terraform modules

Terraform modules are a way to reuse Terraform code in different places, and to an end, deploying multiple stages just means reusing the code of the whole application multiple times. So our goal here is to extract a module that deploys the whole application, which we can then use with different parameters.

How do modules work in Terraform?

We can create a directory with some Terraform files in your project and call that module with a module declaration in the main project.

Inputs: If we want to pass data into the module, we can use a variable-statement inside the module.
Outputs: If we want to return data from the module, we can use an output-statement.

Note that we’ve used both statements before in previous blog posts. We used variables to pass in an access token via a hetzner.auto.tfvar file, which we didn’t want to add to git.

We used output to show the actual function URL of a deployed Lambda because we wanted to test this part without applying DNS changes. We also used and output for the bucket name of the website files, so that read it from the script that actually copied the files to the bucket.

A module is simply a directory with terraform configurations. We can pass in variables and retrieve outputs. For example, consider this file structure:

├── main.tf
└── trivial-module
    └── main.tf

The file trivial-module defines a variable and return a that variable with a prefix.

variable "some-variable" {
  type = string

}

output "some-return-value" {
  value = "The return of ${var.some-variable}"
}

The root module (i.e., /main.tf) calls this module, passes in the variable, and outputs the return value (which will print it to the console)

module "trivial-module" {
  source        = "./trivial-module"
  some-variable = "the king"
}

output "module-return-value" {
  value = module.trivial-module.some-return-value
}

Running terraform apply in the root directory, will output “The return of the king” to the console.

So, to create a module from our app, we just move it to a directory?

Basically, yes. But we should follow some best practices. So far, we had the following file structure

.
├── certificate.tf
├── cloudfront.tf
├── hetzner-dns.tf
├── lambda.tf
├── providers.tf
└── s3-website-files.tf

The variables and outputs are distributed across these files, which makes it difficult to find them all.

The Terraform documentation provides this basic structure of a module

.
├── LICENSE
├── README.md
├── main.tf
├── variables.tf
├── outputs.tf

The modules in terraform-aws-modules also have a versions.tf file containing required provider versions.

I would not put all Terraform code into a single main.tf, but having a separate variables.tf, outputs.tf and versions.tf certainly makes sense. It makes sense to split the code into more modules, but not for this post.

Basic structure of the multi-stage setup

So far, we’re using this structure:

.
├── modules
│   └── lambda-with-website
│       ├── certificate.tf
│       ├── cloudfront.tf
│       ├── hetzner-dns.tf
│       ├── lambda.tf
│       ├── outputs.tf
│       ├── s3-website-files.tf
│       ├── variables.tf
│       └── versions.tf
├── README.md
└── stages
    └── prod
        ├── hetzner.auto.tfvars
        ├── main.tf
        ├── providers.tf
        └── variables.tf

In the file stages/prod/main.tf we call the module modules/lambda-with-website. Let’s consider some more aspects of this migration:

Variables and Outputs

Which inputs and outputs do we actually need? Which resources actually differ from stage to stage?

Variable domain_name: We certainly want to deploy “staging” on a different domain than “production.”
Variable dns_zone: Maybe we don’t want to deploy staging to a subdomain of “knappi.org” at all.
Variable name_prefix: If we use a different AWS Account for all stages, we don’t need this input. But in the projects that I worked in, we usually had only two accounts, one for production and one for non-production deployments. That means we have all resources for staging, dev, and review environments in one account. To prevent duplicate resource-names, we need to add some kind of prefix, like staging-, or dev-.
Output s3_bucket_name: We still need to copy our files to the created bucket. If we want to have the source of truth in our Terraform code, we need to have this as an output. However, there are cases where you might want to use a naming convention instead; for example, if the frontend code lives in a different repository. In that case, we don’t need this output.

Providers

The Terraform documentation states that a reusable module shouldn’t have provider definitions. Instead, we’re only allowed to specify the terraform.required_providers configuration.

The aws.us_east_1 provider alias is a bit tricky here, because it requires special configuration. For reusable modules, this is done via the configuration_aliases property:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
      configuration_aliases = [aws.us_east_1]
    }
    archive = {
      source  = "hashicorp/archive"
      version = "~> 2.0"
    }
    hetznerdns = {
      source  = "timohirt/hetznerdns"
      version = "2.1.0"
    }
  }
  required_version = ">= 0.13"
}

Missing support in Jetbrains products

In my current version of the “Terraform and HCL” plugin (243.23654.44), the configuration_classes property was not supported. This is a bit frustrating because I get no autocompletion and the aws.us_east_1 line is always shown as error. I hope this will get better soon.

In our root module, we need to define the correct providers. The aws.us_east_1 provider must be passed in explicitly, because it is a provider with an alias.

In stages/prod/providers.tf:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
    archive = {
      source  = "hashicorp/archive"
      version = "~> 2.0"
    }
    hetznerdns = {
      source  = "timohirt/hetznerdns"
      version = "2.1.0"
    }
  }
  required_version = ">= 0.13"
}

# Configure the AWS Provider
provider "aws" {
  region = "eu-west-1"
}
# Configure the AWS Provider
provider "aws" {
  alias  = "us_east_1"
  region = "us-east-1"
}

# Configure the Hetzner DNS Provider
provider "hetznerdns" {
  apitoken = var.hetzner_dns_token
}

In stages/prod/providers.tf:

module "lambda-with-website" {
  source      = "../../modules/lambda-with-website"
  dns_zone    = "knappi.org"
  domain_name = "lambda-example.knappi.org"
  name_prefix = ""
  providers = {
    aws.us_east_1 = aws.us_east_1
  }
}

Migrations

So, we’ve extracted our module, we copied the Terraform state to stages/prod and now we run terraform apply -auto-approve…

… and see with tears in our eyes that

resources like the Lambda and Cloudfront have been removed,
the deployment was canceled midway because the S3 bucket couldn’t be deleted. It still contained some files.

But why should it be removed in the first place? We didn’t want to create new AWS resources. We just wanted to refactor our Terraform definitions.

To Terraform, the resource for the S3 bucket is called aws_s3_bucket.static-website it is stored in the tfstate under that name. Moving the definition to the module changed that name. It is now called module.lambda-with-website.aws_s3_bucket.static-website.

Since Terraform doesn’t know of our refactoring, it has to assume that the prior resource has to go and the new one has to be created.

In our example app, this is no problem. No one is using it anyway, and we don’t have any relevant data that is not also part of our repository. But for the common application with user data and uploaded files, we need to have a way to keep our resources intact.

Moved

Let’s pretend we didn’t apply those changes but rather had a look at the plan first. We could have solved the problem by using a moved statement:

moved {
  from = aws_s3_bucket.static-website
  to   = module.lambda-with-website.aws_s3_bucket.static-website
}

This tells Terraform that we moved the S3 bucket to the module. It can keep the resource instead of re-creating it.

Disclaimer

I’m not an expert in Terraform. I’m just learning it. But I have a lot of general experience in software development. The following statement may not be true, but I think it is highly likely.

If we want to ensure a zero-downtime deployment for this migration, I would assume that we need to do more than keep the S3 bucket. We need to make sure CloudFront is not deleted, and it would also be a good idea to keep the Lambda around.

It tried to apply moved statements manually and running terraform plan over and over again. It was slow and frustrating. Then I ran:

bin/plan.sh prod | grep destroyed

and did some regex-replace magic, replacing

# (\S+) will be destroyed

with

moved {
 from = $1
 to = module.lambda-with-website.$1
}

Screenshot of the regex-replace command in IntelliJ Ultimate

and saving everything as migration-001.tf.

I’m sure this would have worked if I had not done other changes as well, such as fixing the name of the Lambda. In the end, I’m not sure if the deployment was without downtime or not. But I’m sure of my conclusion.

Conclusion

Extracting the whole application as a module is a viable way to do multi-stage deployments. It is cumbersome to do this while the system is already running, without causing downtime.

If the S3 bucket is called website-static it can’t be renamed to prod-website-static easily.

It would have been much easier to start with a single application module, a single stage, and correct names from the beginning.

The current state of the example project can be found in branch 0035-multi-stage-setup. If you want to try the migration yourself:

Checkout 0034-cloudfront-s3-multi-origin.
Setup accounts and access tokens as described in the previous articles ( setup hetzner.auto.tfvars and your AWS account).
Run bin/deploy.sh to deploy everything.
Checkout 0035-multi-stage-setup, copy terraform.tfstate and hetzner.auto.tfvars to stages/prod.
Remove stages/prod/migration-001-move-to-module.tf.
Run bin/plan.sh prod and look at the plan. Everything is removed and created anew.
Recreate stages/prod/migration-001-move-to-module.tf from the git history.
Run bin/plan.sh prod and look at the plan again. Some things are replaced, but most of them are retained.
Run bin/deploy.sh prod to deploy everything.

My other conclusion is that Terraform is great and I really like it. I can extract functions (i.e., modules) and keep my code clean that way. That’s all I need…