How to use Terraform at Scale for Federated IT Systems.

13 min readMay 14, 2021

Introduction to Terraform

Terraform is an opensource infrastructure as code software tool that provides a consistent approach and tools to manage cloud services. Terraform codifies cloud API’s into declarative configurations files enabling individuals and teams for easy provisioning of these cloud services and manage the state of their infrastructure.

Using this tool enables organisations to keep a track of the changes made to the infrastructure and rolling back to a previous state at any given point of time using any of the supported VCS.

With Terraform a developer is not only enabled to write infrastructure as code using declarative files but is also able to review the configuration changes before provisioning of services and plan the changes which matches expectations of the overall workload before actual provisioning.

Terraform also enables state management for easy rollbacks in case anything goes wrong with a change or can be used as a part of your disaster recovery and remediation plan through use of back ends and VCS.

Terraform is highly extensible and supports 500+ providers for cloud deployments.

A SIMPLE TERRAFORM WORKFLOW

A developer starts with identifying the requirements for the infrastructure and based on the need and business requirements writes the infrastructure code as a terraform(.tf) file. Then he can use the commands terrafom init, terraform plan, terraform apply to check his plan and provision the infrastructure on the cloud.

So far this looks good when we are working with very small teams and at a small scale. Resources keeps getting added to the core infrastructure and managed properly with the terraform state files, remote back ends and VCS to track any changes.

This is easy to maintain and manage, but there are still scope of improvement in order to avoid technical debt through using terraform modules and decomposing the overall architecture.

In general, this is how what a root directory might look like to provision 2 application stacks on 3 environments (QA, Staging & Prod) operating under a single VPC. This is close to some 400 lines of codes. Think of when it starts scaling up to 100–200 servers being managed at scale.

When your infra needs are smaller, 2 people working on infrastructure code is manageable. However, as and when the production scale graph moves up managing all this infra will start becoming chaotic. One small change in any critical infra resource can lead to failure of an existing workload and result in business or financial loss.

A FEW PROBLEMS WHEN SCALING UP

At scale adding more people, projects, environments increase the problems which comes with scalability.

Technical Debt
Multi-Tenancy, Visibility and Traceability
Code Re usability
Knowledge Silos
Risks of failure of existing infrastructure
Recovery from Failure
Diminishing Trust

To accommodate more people to develop and manage your infrastructure as the organisation scales, where you should not allow everyone to make changes to your infrastructure independently. It becomes more important to control the changes to your infrastructure.

Let us find out how we can cater to these problems with scaling by using the Terraform at scale framework.

This is primarily the work of Armon Dadgar (CEO-Hashicorp), an exemplary thought leader, who has beautifully explained in one of his walkthroughs on youtube.

Let us see how using Terraform and Terraform Cloud can help us manage all these problems with scaling and help us achieve terraform at scale.

PROBLEM WITH TECHNICAL DEBT WITH MONOLITHIC CODE BASE

To overcome this the first principle is to think of your infrastructure code like any application code and how you can decompose and break down your infrastructure to modular components and expose only what is necessary.

This is where it becomes important to think like your infrastructure stack as an application stack and ensure you are able to decompose your overall stack into modular components.

As Armon suggested in his walk through that you shall start looking at decomposing your infrastructure code as core services, middle-ware as services which can be used as shared resources and finally the application stack. Your infrastructure code stack shall look something like the below stack.

It is advised to break down large monolithic Terraform configurations into smaller ones, then assign each one to its own workspace and delegate permissions and responsibilities for them to the right users as per their roles.

You can use Terraform Cloud Workspaces to decompose your stack and manage your stack more effectively. Workspaces are Collections of Infrastructure which allows your teams to set up multiple distinct sets of infrastructure to manage them separately. Users are defined and permissions can be granted for modular control over the workspaces.

Terraform Cloud workspaces and local working directories serve the same purpose, but they store their data differently:

Key Takeaway:

Depending on your organisation structure and workflow:

Since Terraform CLI uses content from the directory it runs in, you can organise infrastructure resources into meaningful groups by keeping their configurations in separate directories.
breakdown your infrastructure into core modules and service modules.
keep as much as key configurations as variables.
Further you can store these directories as separate repositories in VCS, with shall be N:1 relationship in your VCS; which means as an example your core network can be stored in a repo called tf-net and then inside a VPC resource you can have the ability to create multiple VPC’s as required.
A good strategy to start with is <COMPONENT>-<ENVIRONMENT>-<REGION>

A real world example might look something like below where each folder is a separate repository in your VCS and separate workspace on Terraform Cloud:

Note: These workspaces, managed with the terraform workspace command, aren't the same thing as Terraform Cloud's workspaces.

PROBLEMS WITH MULTI-TENANCY, RE-USABILITY, VISIBILITY, TRACEABILITY AND KNOWLEDGE SILOS

Multi-tenancy refers to a software architecture design in which a single instance of a software application serves multiple customers. In case of terraform when you start coding your infrastructure and don’t consider the granular plan of extending your code for other projects the resultant variant is a monolithic repository which sooner or later start to creep up the problem of technical debt.

Terraform uses variables and modules to define the overall architecture, which can help to define your code in a way that you can reuse it to deploy similar resources across different tenants.

You can use Terraform published Modules on Terraform registry to use them directly while de-composing your infrastructure code.

For resources on AWS Cloud you can use the modules and examples provided by Anton Babenko and also include them as part of your core modules if using AWS Cloud as your core Cloud Provider — https://github.com/terraform-aws-modules.

Anton is a long time developer and CTO of Betajob AS in Norway and helps companies around the globe building solutions using AWS, specialising on infrastructure as code, DevOps, and reusable infrastructure components since 2015.

Anton is the maintainer of several Terraform/AWS modules and related projects. Don’t miss his technical Blog: https://www.antonbabenko.com for some great information on DevOps, AWS and Terraform Automation.

Example Usage

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"  name = "my-vpc"
  cidr = "10.0.0.0/16"  azs             = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]  enable_nat_gateway = true
  enable_vpn_gateway = true  tags = {
    Terraform = "true"
    Environment = "dev"
  }
}

You can also set these variables as part of Terraform Cloud variables. Terraform Cloud can set values for two kinds of variables for each configured workspace.

Terraform input variables, which define the parameters of a Terraform configuration.
Shell environment variables, which many providers can use for credentials and other data. You can also set environment variables that affect Terraform’s behavior, like TF_LOG.

Workspaces can also be used to have clear visibility of your current run states, latest changes, and the repo it connects to. With a connected VCS repository, Terraform Cloud can automatically fetch content from supported VCS providers, and uses webhooks to get notified of code changes and hence the entire process is automated or you can also configure it as part of your remote backend to push your changes constantly to cloud.

Moreover, if you are de-composing your infrastructure in the above structure it is preferable to use Terraform Cloud’s private module registry helps you share Terraform modules across your organization. This will help different developer to include these modules as part of their infrastructure code and reuse the capability once developed and avoid repeatability.

It not only supports module versioning, a searchable and filterable list of available modules, but also provides a configuration designer to help you build new workspaces faster.

Note: The public registry uses a three-part <NAMESPACE>/<MODULE NAME>/<PROVIDER> format, and private modules use a four-part <HOSTNAME>/<ORGANIZATION>/<MODULE NAME>/<PROVIDER> format.

EXAMPLE USAGE

module "vpc" {
  source  = "app.terraform.io/example_corp/vpc/aws"
  version = "1.0.4"
}

Modules can be further used to classify and standardise configurations required for a similiar group of applications and can be used as a blackbox. Your overall Infrastructure stack after using the terraform module and module registry will look like:

This will enable the developers who might not be very conversant with the cloud infrastructure definitions to write configurations required for their applications. They can use the input outputs of these modules to define and provision the required infra for their needs.

If you look into the diagram, the registry stores different modules for pre-approved standards for deploying similiar sort of applications, so teams who will like to deploy a JavaApp can quickly call this module and define the exposed input variables # of instances, #region, #appjarname to provision the infrastructure. They dont have to know what the rest of the modules does in the background. The module can further create the required count of instances, keypairs, route tables, load balancers, dnsroutes and security group rules for deploying the java application. Then the team can use the K8Cluster modules to plug that in to get their application deployed on a K8Cluster.

We can expose module registry for common set of technologies for on boarding new application teams quickly by provising the required resources trough automation in the process of managing and deploying architecture.

Module Registries are ready made, reusable infrastructure packages that non-experts in Terraform can use to provision pre-approved infrastructure themselves. Producers for these modules can standardise application deployments by creating module registries. And the consumers can directly use them for provisioning infrastructure without the need to understand the underlying infrastructure which remains a black box for them.

Example Usage

module "javapps" {
  source = "app.terraform.io/example_corp/javaapps"name = "tour-of-heroes"
required_instance_count = 3
appjarname = "tour_of_heroes.jar"tags = merge(
    var.default_tags,
    {
      Name = "MyHeroApp"
    },
  )

Key Takeaways:

Terraform input variables, which define the parameters of a Terraform configuration.
Shell environment variables, which many providers can use for credentials and other data. You can also set environment variables that affect Terraform’s behavior, like TF_LOG.

Key Takeaways:

Break your Infrastructure stacks into smaller components which are easier to manage and control.
Use as much of variables, functions, expressions, data blocks for defining your infrastructure and keeping it flexible for reuse.
Use Private Module Registry of Terraform Cloud for exposing all modules to operational and other users for code re-usability.
Track changes of runs and save state informations in remote backend for traceability.
Develop your modules for similar categories and groups so that even non-terraform experts or non-infrastructure experts can also understand your code and use it.
The idea shall be to create a self-service module for your infrastructure.

Note: Terraform can be integrated with ServiceNow and Splunk for ticket based provisioning and remote monitoring.

MANAGING RISK OF FAILURES AND RECOVERING FROM FAILURES AND MANAGING DIMINISHING TRUST AND ENABLING STRONGER CONTROLLED

Another challenge with scaling up is diminishing trust and increased risk of failure while deploying new infrastructure, similiarly like we have an issues with any application deployment.

We can solve that problem through implementing RBAC and code review flows, which we do for our applications too.

All repositories can be configured to be used and modified by specific group of people based on role based access on repositories. All code commits can be further reviewed using pull requests for final approval or tested against pre-defined policies so that the final review is less cumbersome for any architect.

CloudOps and DevOps Architects can define these policies at organisational, workspace and module levels. Using Terraform Cloud can further simplify this process and automate this code review process using policy as code framework like Sentinel.

Policy as code is a codified logic for reviewing new infrastructure templates or modules to ensure compliance and system safety with fast feedback, and eliminates the need to wait for a code review to happen. Terraform can use Sentinel framework and using the same, a policy can be enforced on change and Sentinel can decide whether the change is allowed or not based on the policy definition.

Sentinel is an embedded policy-as-code framework integrated with the HashiCorp Enterprise products. It enables fine-grained, logic-based policy decisions, and can be extended to use information from external sources.

Using Sentinel with Terraform Cloud involves:

Defining the policies — Policies are defined using the policy language with imports for parsing the Terraform plan, state and configuration.
Managing policies for organizations — Users with permission to manage policies can add policies to their organization by configuring VCS integration or uploading policy sets through the API. They also define which workspaces the policy sets are checked against during runs.
Enforcing policy checks on runs — Policies are checked when a run is performed, after the terraform plan but before it can be confirmed or the terraform apply is executed.
Mocking Sentinel Terraform data — Terraform Cloud provides the ability to generate mock data for any run within a workspace. This data can be used with the Sentinel CLI to test policies before deployment.

Policies are managed as parts of versioned policy sets, which allow individual policy files to be stored in a supported VCS provider or uploaded via the Terraform Cloud API.

Policy sets are groups of policies that can be enforced on workspaces. A policy set can be enforced on designated workspaces, or to all workspaces in the organisation.

After the plan stage of a Terraform run, Terraform Cloud checks every Sentinel policy that should be enforced on the run’s workspace. This includes policies from global policy sets, and from any policy sets that are explicitly assigned to the workspace, which enables you to define global policies for all infrastructure and specific policies to a specific workload configuration.

All policies are further driven by enforcement levels. Enforcement levels in Sentinel are used for defining behavior when policies fail to evaluate successfully. Sentinel provides three enforcement modes as

hard-mandatory requires that the policy passes. If a policy fails, the run is halted and may not be applied until the failure is resolved.
soft-mandatory is much like hard-mandatory, but allows any user with the Manage Policy Overrides permission to override policy failures on a case-by-case basis.
advisory will never interrupt the run, and instead will only surface policy failures as informational to the user.

Example Usage

policy "terraform-maintenance-windows" {
  source            = "./terraform-maintenance-windows.sentinel"
  enforcement_level = "hard-mandatory"
}

Sentinel policies themselves are defined in individual files (one per policy) in the same directory as the sentinel.hcl file and can be further managed using the policy set section of the organisation.

Example Usage

The policy makes sure S3 buckets have tags attached. If so, the policy passes. If any S3 bucket found does not have a tag, the policy fails.

import "tfplan/v2" as tfplans3_buckets = filter tfplan.resource_changes as _, rc {rc.type is "aws_s3_bucket" and(rc.change.actions contains "create" or rc.change.actions is ["update"]) }bucket_tags = rule {all s3_buckets as _, instances {instances.change.after.tags is not null     } }main = rule {bucket_tags }

Using policy enforcement you can ensure that any developer submitting a new arbitrary code is enforced to use only infrastructure governed and regulated by defined policies.

Key Takeaways:

Restrict your code changes through automated code review using Sentinel Framework.
Ease out self-service provisioning capabilities.
Inject secrets into terraform using Vault.
Enable RBAC for authorising changes based on roles.

CONCLUSION:

Finally in order to summarise, I see that moving from managing your terraform code locally to using Terraform Cloud can enable:

Code Compliance using policy enforcement.
Modular Framework to breakdown and manage monolithic repo.
Enable less expert people to provision resources on their own.
inbuilt security to keeps infrastructure safe with RBAC and Policy enforcement.
govern associated risks of failure.
Consistency and re-usability of code using modularity.
Control cost with Policy enforcement.

I hope this overall analysis can help you manage your terraform infrastructure at scale and utilise some of the best tools provided by Hashicorp to do so. And yes, if you find this informative and relevant please follow, like and share.

I will like to thanks Armon Dadgar, Anton Babenko for their exemplary thought leadership and inspiring through their knowledge sharing to pursue the research and evaluate the capabilities provided by Terraform.

How to use Terraform at Scale for Federated IT Systems.

Example Usage

EXAMPLE USAGE

Example Usage

Example Usage

Example Usage

Written by Deeptesh Bhattacharya