Building Block by Block

This is the first in a series of blog posts documenting the journey of building infrastructure from scratch. This post will focus on provisioning infrastructure on AWS using OpenTofu, an open-source alternative to Terraform. For those unfamiliar with infrastructure as code (IaC), OpenTofu allows us to define and manage our cloud resources using declarative configuration files. In future posts, we’ll explore configuring servers with Ansible and setting up Docker Swarm for container orchestration.

If you’re itching to see how all this comes together, you can check out the full code here. Feel free to poke around and maybe even use it as inspiration for your own projects.

The Big Picture

Before diving into the details, let’s outline the overall architecture we’re aiming for. The plan involves using Docker Swarm, a native clustering and scheduling tool for Docker. This setup requires two types of nodes:

Manager Nodes: These will handle incoming requests and manage the cluster. They need to be public-facing and accessible from the internet.
Worker Nodes: These will run our actual services. They’ll be placed in a private network for enhanced security. While they can reach out to the internet (e.g., to pull Docker images), they won’t be directly accessible from outside our network.

This separation of concerns enhances security by minimizing the attack surface exposed to the public internet.

VPC

Our first step is to set up a Virtual Private Cloud (VPC) on AWS. Think of a VPC as your own private section of the AWS cloud. Within this VPC, we’ll create two subnets:

A public subnet for our manager nodes
A private subnet for our worker nodes

Here’s the code to set this up:

vpc.tf

resource "aws_vpc" "default" {
  cidr_block = "10.0.0.0/16"
}

resource "aws_internet_gateway" "default" {
  vpc_id = aws_vpc.default.id
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.default.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.default.id
  }
}

resource "aws_subnet" "public" {
  cidr_block = cidrsubnet(aws_vpc.default.cidr_block, 8, 10)
  vpc_id     = aws_vpc.default.id
}

resource "aws_route_table_association" "manager" {
  subnet_id      = aws_subnet.public.id
  route_table_id = aws_route_table.public.id
}

data "aws_eip" "worker" {
  id = var.worker_eip_id
}

resource "aws_nat_gateway" "nat" {
  allocation_id = data.aws_eip.worker.id
  subnet_id     = aws_subnet.public.id
}

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.default.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.nat.id
  }
}

resource "aws_subnet" "private" {
  cidr_block = cidrsubnet(aws_vpc.default.cidr_block, 8, 20)
  vpc_id     = aws_vpc.default.id
}

resource "aws_route_table_association" "worker" {
  subnet_id      = aws_subnet.private.id
  route_table_id = aws_route_table.private.id
}

Let’s break down the key points of this configuration:

We’re creating a VPC with a CIDR block of 10.0.0.0/16, which gives us a range of 65,536 IP addresses to work with.
Two subnets are defined within this VPC:
- The public subnet uses IPs in the range 10.0.10.x.
- The private subnet uses IPs in the range 10.0.20.x.
An Internet Gateway is attached to the VPC. This is essential for allowing our public subnet to communicate with the internet.
A NAT Gateway is set up in the public subnet. This clever piece of networking allows our private subnet instances to initiate outbound traffic to the internet, while still preventing inbound connections from the internet.
We’re assigning a static IP (Elastic IP in AWS terms) to the NAT Gateway. This is particularly important if you need to whitelist your outgoing IP for any third-party services.

The routing tables ensure that traffic is directed appropriately:

The public subnet’s route table sends internet-bound traffic to the Internet Gateway.
The private subnet’s route table sends internet-bound traffic to the NAT Gateway.

Defining Security Groups

With our network structure in place, the next step is to define our security groups. In AWS, security groups act as a virtual firewall for your instances to control inbound and outbound traffic. Here’s how we’re setting them up:

vpc.tf

...

resource "aws_security_group" "manager_node" {
  name   = "${var.name}-manager-node"
  vpc_id = aws_vpc.default.id

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    from_port        = 80
    to_port          = 80
    protocol         = "tcp"
    cidr_blocks      = ["0.0.0.0/0"]
    ipv6_cidr_blocks = ["::/0"]
  }

  ingress {
    from_port        = 443
    to_port          = 443
    protocol         = "tcp"
    cidr_blocks      = ["0.0.0.0/0"]
    ipv6_cidr_blocks = ["::/0"]
  }

  ingress {
    from_port        = 443
    to_port          = 443
    protocol         = "udp"
    cidr_blocks      = ["0.0.0.0/0"]
    ipv6_cidr_blocks = ["::/0"]
  }

  egress {
    from_port        = 0
    to_port          = 0
    protocol         = "-1"
    cidr_blocks      = ["0.0.0.0/0"]
    ipv6_cidr_blocks = ["::/0"]
  }
}

resource "aws_security_group" "worker_node" {
  name   = "${var.name}-worker-node"
  vpc_id = aws_vpc.default.id

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = [aws_subnet.public.cidr_block]
  }
}

Let’s examine the security group configurations: For Manager Nodes:

Incoming SSH access (port 22) is allowed from anywhere. In a production environment, you might want to restrict this to specific IP ranges.
HTTP and HTTPS traffic (ports 80 and 443) are open to the world, as these nodes will be handling incoming web requests.
All outgoing traffic is allowed, which is necessary for updates, downloading Docker images, etc.

For Worker Nodes:

The only incoming traffic allowed is SSH, and only from the public subnet. This means you’ll need to SSH into a manager node first, then hop to a worker node.
Outgoing traffic rules are not explicitly defined here, but you might want to add them based on your specific needs.

This setup provides a good balance of accessibility and security. Manager nodes are reachable for web traffic and administration, while worker nodes are tucked away in the private subnet, shielded from direct internet access.

Key

We’ll generate a 4096-bit RSA key pair. This key will be used later when creating our AWS instances, and it will also be used for SSH access to these instances.

resource "tls_private_key" "rsa_4096" {
  algorithm = "RSA"
  rsa_bits  = 4096
}

resource "aws_key_pair" "default" {
  key_name   = var.name
  public_key = tls_private_key.rsa_4096.public_key_openssh
}

Nodes

Now that we have our network infrastructure and SSH key set up, let’s move on to creating our EC2 instances.

Selecting the AMI

First, we’ll get the AMI for Amazon Linux 2023:

nodes.tf

data "aws_ami" "base" {
  owners      = ["amazon"]
  most_recent = true
  name_regex  = "^al2023-ami-2023.*-.*-x86_64"
}

This data source fetches the most recent Amazon Linux 2023 AMI. Using a regex pattern ensures we get the latest version of the AMI we want.

Defining the Instances

Now, let’s define the instances for our manager and worker nodes:

nodes.tf

...

resource "aws_instance" "manager" {
  ami                    = data.aws_ami.base.id
  instance_type          = var.manager_type
  subnet_id              = aws_subnet.public.id
  private_ip             = cidrhost(aws_subnet.public.cidr_block, count.index)
  vpc_security_group_ids = [aws_security_group.manager_node.id]

  associate_public_ip_address = true

  tags = {
    Name = "${var.name}-manager-${count.index + 1}"
  }

  ...
}

data "aws_eip" "manager" {
  id = var.manager_eip_id
}

resource "aws_eip_association" "manager" {
  instance_id   = aws_instance.manager[0].id
  allocation_id = data.aws_eip.manager.id
}

resource "aws_instance" "worker" {
  ami                    = data.aws_ami.base.id
  instance_type          = var.worker_type
  subnet_id              = aws_subnet.private.id
  private_ip             = cidrhost(aws_subnet.private.cidr_block, count.index)
  vpc_security_group_ids = [aws_security_group.worker_node.id]

  tags = {
    Name = "${var.name}-worker-${count.index + 1}"
  }

  ...
}

This configuration creates two nodes: one manager and one worker.

Key points:

The manager node is placed in the public subnet and assigned a public IP.
We’re associating a static IP (Elastic IP) with the manager node. If you don’t need a static IP, you can remove this association and keep associate_public_ip_address = true. Just be aware that the IP will change if the instance is restarted.
The worker node is placed in the private subnet and only has a private IP.

Scaling Up

To add more manager nodes, you can update the count parameter in the manager resource block. For example, to deploy two manager instances:

...

resource "aws_instance" "manager" {
  count = 2

  ...
}

Note: When scaling to multiple manager nodes, you’ll need to use a Network Load Balancer (NLB) to distribute traffic across them. In this case, you’d assign the Elastic IP to the NLB instead of directly to the EC2 instances. This setup hasn’t been tested in this configuration. You can follow this official instruction.

Adding more workers

Similarly, to scale up the worker nodes, adjust the count parameter in the worker resource block. For example, to deploy three worker instances:

...

resource "aws_instance" "worker" {
  count = 3

  ...
}

This setup provides a flexible foundation for your Docker Swarm cluster, allowing you to easily scale the number of manager and worker nodes as your needs grow.

What now?

Let’s recap for a second. We now have:

A manager node in the public subnet, accessible from the internet.
A worker node in the private subnet, not directly accessible from the internet.

How do I access my nodes?

Good question. You can’t, for now. You need to get the IPs and the private key to be able to SSH into the instances. Let’s update the code a bit to make this possible.

First, let’s save the private key needed for SSH:

key.tf

...

resource "local_file" "default" {
  content  = tls_private_key.default.private_key_pem
  filename = "id_rsa"
}

This block will save the private key into a file called id_rsa.

Next, let’s generate a hosts file with our instance information:

nodes.tf

...

resource "variable_file" "hosts" {
  filename = "hosts"
  content = templatefile("${path.module}/hosts.tmpl",
    {
      manager = aws_instance.manager
      worker  = aws_instance.worker
    }
  )
}

This block generates a file based on hosts.tmpl using data from manager and worker nodes.

Here’s what the hosts.tmpl file should look like:

hosts.tmpl

[manager]
%{ for instance in manager ~}
${instance.tags.Name} ${instance.public_ip}
%{ endfor ~}

[worker]
%{ for instance in worker ~}
${instance.tags.Name} ${instance.private_ip}
%{ endfor ~}

We’re not going to use this hosts file for anything but getting the IPs. You might notice that we use public_ip for the manager but private_ip for the worker. Why? Because we added an EIP to the manager, we can get its public IP. We can’t do that with workers because we never assigned any public IP to them, so we only have their private IPs.

Connecting to Your Nodes

To SSH into manager nodes, run:

ssh ssh -i id_rsa ec2-user@MANAGER_IP

But how do we connect to worker nodes? They’re inside a private subnet, inaccessible from the internet. The answer is by using Proxy Jump, essentially using the manager node as a bastion host proxy to connect to worker nodes.

Here’s how it works: the manager has a public IP address and is in the public subnet, which is in the same VPC as the private subnet where worker nodes are located. Because they’re in the same VPC, they can connect to each other.

To SSH into a worker node:

ssh -i id_rsa -o ProxyCommand="ssh -i id_rsa -p 22 -W %h:%p -q ec2-user@MANAGER_IP" ec2-user@WORKER_IP

And that wraps up our infrastructure setup for now. Want to see how it all fits together? The complete code is right here. Next time, we’ll dive into configuring these nodes with Ansible.