Simple AWS
Posts
Advanced Networking on AWS: VPC Design, Peering, and Transit Gateways

Advanced Networking on AWS: VPC Design, Peering, and Transit Gateways

Guille Ojeda
March 23, 2024

Networking is the foundation of any well-designed cloud infrastructure. But let's face it, networking on AWS can get complex quickly, especially when you're dealing with multiple VPCs, hybrid cloud setups, and strict security requirements. Once again, Simple AWS comes to the rescue!

In this article, we'll dive deep into AWS networking. We'll talk about best practices for VPC design, peering, and transit gateway architectures. We'll cover a few things like subnetting strategies, security considerations, multi-region, and hybrid cloud architectures. When you're done, I can't promise that you'll have a solid understanding of things, because a solid understanding requires a lot of practice. What I can promise is that you'll have a pretty good idea of what's what, and you'll know what you don't know.

Got 30 hours to learn AWS Networking? Go grab this excellent course. Got 15 minutes? Read this article.

Designing Scalable and Secure VPC Architectures

Amazon Virtual Private Cloud (VPC) is the core networking service in AWS. VPCs allow you to launch AWS resources in a logically isolated virtual network that you define (basically, a private network). But designing a VPC that's both scalable and secure requires a bit of planning and consideration.

When designing your VPC architecture, one of the first things to consider is your subnetting strategy. Subnets allow you to partition your VPC into smaller, more manageable chunks, each with its own IP address range and network access controls. Believe it or not, a well-designed subnetting strategy can help you improve performance, security, and scalability. This is achieved by isolating different types of workloads and controlling traffic between them.

Another key consideration in VPC design is high availability. By launching your resources across multiple Availability Zones (AZs), you can ensure that your application remains available even if an entire AZ goes down. When designing for high availability, it's important to consider things like load balancing, auto scaling, and failover mechanisms that will let you avoid having a single point of failure.

Security in VPC Design

Of course we'll talk about security! AWS gives you several tools and features to help you secure VPCs, the most basic being network access control lists (NACLs) and security groups.

NACLs act as a firewall for controlling traffic in and out of your subnets. They allow you to define inbound and outbound rules based on IP address ranges, protocols, and ports. They work at the subnet level, meaning the same rules apply to the subnet and everything in it. They're especially useful because you can allow everything and set a few deny rules, easily blocking some IP addresses (you can't set deny rules on security groups). They're stateless, meaning that you need a rule to allow traffic out, and a separate rule to allow the response back in (typically over the ephemeral ports from 32768 to 65535).

Security groups, on the other hand, act as a virtual firewall for your Elastic Network Interfaces (ENIs). They allow you to set inbound and outbound rules (allow only, anything that isn't allowed is denied). The fact that they're associated with ENIs means you can set different rules for different resources in the same subnet. In case you didn't know, ENIs are what give AWS resources an IP address inside a VPC. EC2 instances and any other resource with an IP address have at least one (EC2 instances come with a default one, and you may attach additional ones).

Security groups allow you to do some cool things like define rules based on other security groups or other high-level constructs like VPC Endpoints. Additionally, they're stateful, meaning if a request was allowed, its associated response is automatically allowed regardless of inbound rules. We already mentioned security groups in the article about securing the connection from EC2 to S3, where we used one to control the traffic that's allowed into a VPC Endpoint.

Another important security feature in VPCs is VPC flow logs. Flow logs allow you to capture information about the IP traffic going to and from your VPC, including source and destination IP addresses, ports, and protocols. Like the name suggests, they're logs for the traffic that flows across a VPC.

VPC Architecture and Design Techniques

Yes, VPCs are also architected! Who would have guessed AWS wasn't actually simple? Well, I'll still do my best to simplify it.

Subnet Partitioning

One of the most important aspects of VPC design is subnet partitioning. Subnets are smaller networks within your VPC that allow you to organize and isolate your resources based on their security and operational requirements.

When partitioning your VPC into subnets, keep these things in mind. First, consider the different types of resources you'll be deploying in your VPC, and how they should be grouped together. For example, you may want to create separate subnets for your web servers, application servers, and database servers, in order to apply different security policies and access controls to each tier.

Another consideration is high availability and fault tolerance. Create subnets in multiple Availability Zones (AZs), so your application remains available even if an entire AZ goes down. Reminder: each AZ is a physically separate data center with its own power, cooling, and network connectivity, providing built-in redundancy and failover capabilities.

When designing subnets, generic guidelines will tell you to "consider the size and number of subnets you'll need", and the answer is always "it depends" (or if you're a consultant, it's "it depends, that'll be $1000"). Read the following section first, to understand CIDRs. After that I'll give you some useful prescriptive advice.

CIDRs for VPCs and Subnets

Learning CIDRs sucks, but it's necessary. It stands for Classless Inter-Domain Routing, and they are a notation to denote the blocks of IP addresses that VPCs and subnets (or in fact networks in general) can use. The "classless" part comes from the fact that back in the stone age (when the internet was used by like 100 entities instead of the many millions we have today) networks were divided in Class A, B and C.

The CIDR notation works like this:

IPv4 addresses consist of 4 numbers, each going from 0 to 255. For example 192.168.0.1. That's actually the human readable representation, in network stuff it's 32 bits.
All resources in a network share the same prefix (some of those 32 bits). The network is identified by its prefix. For example, 192.168.0.0 is a network, and the 2nd address inside it is 192.168.0.1
Networks reserve two addresses: the first one is the network address, and the last one is used to broadcast to all devices in the network. AWS reserves 3 more for each subnet: the second one for the VPC router, the third one for the DNS server and the fourth one for future use.
Each device in the network gets an IP address that's not in use or not reserved. So, how many IP addresses your network has determines how many resources you can launch in that network.
CIDR is used to express in an easy way the network address and the size (i.e. how many IP addresses it has). You use the network address and how many bits it uses, for example 10.0.0.0/16 means the network address uses 16 bits. 10.0.0.0/8 means the network address uses 8 bits.
A network has a number of IP addresses equal to 2^(32 - bits in the network address) - reserved addresses. For example, for a 10.0.0.0/24 VPC, you'll have 2^8 - 5 = 251 addresses, and for a 10.0.0.0/23 VPC you'll have 2^9 - 5 = 507 addresses.
CIDR can also be used for IPv6, though for IPv6 you don't need to plan networks, they're already ridiculously big.

So, the point is that you want to choose a network size that will allow you to deploy all the resources you need. A /24 network supports 251 resources, which sounds like a lot, but can get exhausted quickly as you scale, since:

ALBs consume 5 addresses
NAT Gateways consume one per AZ
AWS PrivateLink consumes more than one
Virtual Private Gateways consume one as well
Each execution environment of a Lambda function in a VPC consumes one address
Each EC2 instance or ECS container obviously consumes one.

CIDR Best Practices

Alright, time for the prescriptive advice. The first rule of networking is that IPv6 has been around for over 25 years, so stop pretending it's new and start using it if you can. Especially since AWS now charges you for each public IPv4 address (not just Elastic IP addresses, all of them). It removes all the hassle of IP address management!

Second rule: Don't use 10.0.0.0 for all your VPCs. You'll thank me the day you need to peer two VPCs that you weren't planning on peering. You can use anything from 10.0 to 10.255, so get a bit creative!

Third rule: Small public subnets, big private subnets. Make your public subnets something like 10.20.0.0/26, which leaves you with 59 addresses, way more than enough to drop a load balancer and a NAT Gateway. Make private subnets bigger, usually /21, for 2043 addresses for EC2 instances, containers, Lambda functions, RDS instances, VPC Endpoints, and the tons of stuff you end up needing. Be consistent, please. Of course, deploy your things in private subnets, for security.

Public Subnets and Private Subnets

Let's dive a bit more into public and private subnets. Public subnets are subnets that are accessible from the public internet, while private subnets are not accessible from the public internet.

Most VPCs will have an Internet Gateway (IGW), which lets you access the internet (and the internet access your VPC). The route table of each subnet is the place where you determine whether that subnet can access and be accessed from the internet. If the route table contains a route for 0.0.0.0/0 that points to the Internet Gateway, then that subnet has internet access and is accessible, and it's public. If there's no route to the IGW, it's a private subnet.

Public subnets are used for resources that need to be accessed from the internet, like your public load balancers. Private subnets, on the other hand, are used for resources that don't need direct internet access, like literally everything that you can. Private subnets can still communicate with other subnets within the VPC, but cannot be accessed directly from the internet. You can use a NAT Gateway to have resources in a private subnet access the internet without being accessible from it.

Technically, it's not enough to just place resources in a public subnet. They also need a public IP address, which is something you define upon resource creation (subnets can set a default, which you can override). But it's still a good practice to have private subnets, as a layer of defense against someone making a configuration error there.

Connecting VPCs with VPC Peering

For a single system, a single VPC is usually enough. However, there are many scenarios where you'll need to connect multiple VPCs together. That's where VPC peering comes in.

VPC peering allows you to create a direct network connection between two VPCs, enabling them to communicate with each other using private IP addresses. This allows resources in one VPC to route network packets to resources in the other VPC through private IP addresses, meaning nothing is exposed to the public internet or to anything outside the two VPCs.

VPC Peering is typically used when you need to connect two applications, more often than not across separate AWS accounts. Technically VPC Peerings count as multi-region, but if you're building a single application across multiple regions, it's rare that you'll want instances to talk to each other directly, mostly because of latency. Still, it's totally possible, so if you need to connect two systems in different regions, it will work.

Setting up a VPC peering connection is relatively straightforward, but there are a couple of things to keep in mind. First, you'll need to ensure that the CIDR blocks of the peered VPCs do not overlap. You'll also need to update your route tables to ensure that traffic is directed through the peering connection as intended. I'd give you a step by step tutorial, but I think the AWS docs do a good enough job.

Once your VPC peering connection is set up, you can further secure it using network ACLs and security groups. By defining fine-grained inbound and outbound rules, you can tightly control which resources are allowed to communicate across the peering connection, and on which ports and protocols. The principle of least privilege that I always talk about also applies here: don't allow more origins, destinations, protocols or ports than the absolute minimum you need.

And by the way, you don't need port 22 to SSH. Just use Session Manager.

Simplifying Multi-VPC and Hybrid Cloud Networking with Transit Gateway

VPC Peering works well for a couple of VPCs, but it has a few limitations that prevent it from scaling:

VPC Peerings are not transitive. This means if you peer VPC A with VPC B and VPC B with VPC C, VPC A can't route traffic to VPC C. So if you have 5 VPCs, you'll need 10 peering connections. 10 VPCs? That's 55 peering connections.
VPC Peerings only work if the VPC CIDR ranges don't overlap. This isn't a limitation if you plan for it. But if you're connecting two existing systems, I bet you'll run into this issue.

AWS Transit Gateway is the appropriate service to connect multiple VPCs at scale. It's a service that allows you to connect multiple VPCs and on-premises networks in a hub-and-spoke architecture. This simplifies network management and reduces the number of point-to-point connections required to connect all your resources, and the number of headaches.

One of the key benefits of Transit Gateway is that it acts as a central hub for routing traffic between VPCs and on-premises networks. This allows you to centrally manage and monitor network traffic, as well as to apply consistent security policies across all connected resources.

Setting Up and Configuring Transit Gateway

When creating a Transit Gateway, you'll need to specify the following configuration options:

Description: A simple description of your Transit Gateway for easy reference.
Amazon Side ASN: The private Autonomous System Number (ASN) for the Amazon side of a BGP session. I'm not even going to explain this, either use the default or watch this course.
DNS Support: Whether to enable DNS support for your Transit Gateway. This allows EC2 instances to use AWS's DNS to resolve public IPv4 addresses.
VPN ECMP Support: Whether to enable Equal Cost Multipath (ECMP) routing support for VPN connections. This can help improve the resiliency and performance of your VPN connections.
Default Route Table Association: Whether to automatically associate Transit Gateway attachments with the default route table.
Default Route Table Propagation: Whether to automatically propagate routes from Transit Gateway attachments to the default route table.
Auto Accept Shared Attachments: Whether to automatically accept cross-account attachments requests.

Here's an example of what creating a Transit Gateway might look like with the AWS CLI:

aws ec2 create-transit-gateway --description "My Transit Gateway" --options AmazonSideAsn=64512,AutoAcceptSharedAttachments=enable,DefaultRouteTableAssociation=enable,DefaultRouteTablePropagation=enable,VpnEcmpSupport=enable,DnsSupport=enable --region us-east-1

Once the Transit Gateway is created, you'll need to attach your VPCs and VPNs. To attach a VPC, you need to create a Transit Gateway attachment for each VPC. This attachment specifies the VPC ID, the subnets to associate with the Transit Gateway, and the Transit Gateway ID.

Here's how to do that with the CLI:

aws ec2 create-transit-gateway-vpc-attachment --transit-gateway-id tgw-0123456789abcdef0 --vpc-id vpc-0123456789abcdef0 --subnet-ids subnet-0123456789abcdef0 --region us-east-1

To attach a VPN you need to create a Transit Gateway attachment for your Customer Gateway (CGW) and specify the CGW ID, Transit Gateway ID, and VPN options. Here's the AWS CLI command:

aws ec2 create-vpn-connection --customer-gateway-id cgw-0123456789abcdef0 --type ipsec.1 --transit-gateway-id tgw-0123456789abcdef0 --options TunnelOptions=[{TunnelInsideCidr=169.254.100.0/30}] --region us-east-1

Once your VPCs and VPNs are attached, the next step is to configure routing. Transit Gateway uses route tables to determine how traffic should be directed between attachments, pretty much like VPC Subnets use route tables to direct traffic. Each attachment is associated with a single route table, but a route table can be associated with multiple attachments.

By default, Transit Gateway creates a default route table that is automatically associated with all new attachments. However, you can create additional route tables for more granular routing policies. Like this:

aws ec2 create-transit-gateway-route-table --transit-gateway-id tgw-0123456789abcdef0 --region us-east-1

Once you have the route tables in place, you create routes to specify how traffic should be directed. Each route consists of a destination CIDR block and a target attachment ID. Again, example command with the AWS CLI:

aws ec2 create-transit-gateway-route --destination-cidr-block 10.0.0.0/16 --transit-gateway-route-table-id tgw-rtb-0123456789abcdef0 --transit-gateway-attachment-id tgw-attach-0123456789abcdef0 --region us-east-1

This creates a route that directs traffic destined for the 10.0.0.0/16 CIDR block to the Transit Gateway attachment with ID tgw-attach-0123456789abcdef0.

You can also do network segmentation and isolation. This means creating separate route tables for different environments or applications, which lets you can control which attachments can communicate with each other and apply more granular security policies to each segment (remember Least Privilege? Yeah, we're still talking about that).

For example, say you have 2 VPCs, one for a production environment and one for development, and you want to make sure that traffic from the dev VPC can't reach the prod VPC. You could create a route table for prod and one for dev, and associate each VPC attachment with the appropriate route table. You could then create routes in each table to allow traffic within the environment but not between environments.

Why would you do this? Well, this way you can have both access a shared resource that's in a third VPC, like a security attachment, or a Jenkins installation. Well, maybe not a Jenkins installation. Let's act like we're in the 21st century and use something a bit more modern, like CodePipeline.

Hybrid Cloud Connectivity with Transit Gateway

In addition to connecting VPCs, Transit Gateway also makes it easier to connect your on-premises networks to your AWS environment. This can be done using either AWS Direct Connect or AWS Site-to-Site VPN.

Direct Connect allows you to establish a dedicated network connection from your on-premises data center to AWS, providing consistent, low-latency performance. It's literally a cable that goes (semi-)directly to an AWS datacenter. I'll leave it at that, since it's pretty complex on its own, and physical stuff isn't really my area of expertise.

Site-to-Site VPN, on the other hand, allows you to securely connect your on-premises network to AWS over the public internet using IPsec tunnels. We've discussed Client VPN in the past, which is designed to connect your personal computer. Site-to-Site VPN follows the same idea, but it will provide a faster and more robust connection, designed to be always on, with datacenters in mind.

Using either of these two methods you can connect your on premises stuff to your AWS stuff. Transit Gateway handles the network routing and network visibility part of that.

Another cool feature of Transit Gateway is route propagation. It allows your networks to automatically propagate routes from VPN and Direct Connect attachments to your route tables. This allows on-premise routes to be propagated to AWS resources and vice versa, meaning your on-premises networks will automatically be added to the specified route table, allowing your AWS resources to communicate with your on-premises resources without you needing to configure each route separately.

Advanced Transit Gateway Scenarios

Beyond basic VPC and on-premises connectivity, Transit Gateway also supports several even more advanced things. For example, you can share Transit Gateways across multiple AWS accounts using AWS Resource Access Manager (RAM), enabling you to connect VPCs and on-premises networks across different accounts and regions.

Another advanced use case for Transit Gateway is implementing multicast on AWS. Multicast is a method of sending data to multiple recipients simultaneously, useful for apps like video streaming and financial trading. By using a Transit Gateway with multicast support, you can easily deploy multicast applications on AWS without the need for complex workarounds. Not that this solution isn't complex, but trust me, the workarounds would be even more complex.

Last but not least, Transit Gateway integrates with Amazon CloudWatch and VPC Flow Logs for network monitoring and troubleshooting.

Securing AWS Networking with Network Firewalls and IDS/IPS

VPC security groups and NACLs are a strong foundation for securing your AWS network. However, there are some scenarios where you may need more advanced security measures. That's where network firewalls and intrusion detection/prevention systems (IDS/IPS) come into play.

AWS provides a managed network firewall service called AWS Network Firewall that allows you to deploy and manage stateful inspection firewalls across your VPCs. Network Firewall integrates with AWS Firewall Manager, enabling you to centrally manage and enforce firewall policies across multiple accounts and regions.

You can also deploy third-party IDS/IPS solutions in your VPC for deep packet inspection and threat detection capabilities. This can help you identify and block advanced threats like malware, data exfiltration attempts, and zero-day exploits.

I won't dive too deep into this, because I think it falls more into the realm of security than cloud architecture. But I wanted to throw the names around, as a starting point for you to research if this is something you need.

Optimizing Network Performance on AWS

Wait, optimization? I thought VPC was a managed service and you couldn't optimize! Well, you're not far off, but there are a few things you can fine tune to get better network performance.

Instance Size and Network Performance

One of the key factors that can impact network performance on EC2 is instance size. Each EC2 instance type has a different network performance capability, which is determined by the amount of network bandwidth and the number of network interfaces available.

For example, smaller instance types like t2.micro and t2.small have lower network performance compared to larger instance types like c5.large and m5.xlarge. This is because larger instance types have more network bandwidth and support more network interfaces, which allows them to handle more network traffic and achieve higher throughput.

Additionally, some instance types have unique network performance characteristics. For example, instances with enhanced networking (such as those with the "n" prefix, like c5n.large) have significantly higher network performance compared to their non-enhanced counterparts.

Enhanced Networking and SR-IOV

Another way to optimize network performance on AWS is to enable enhanced networking features like SR-IOV (Single Root I/O Virtualization). SR-IOV is a hardware-based virtualization technology that allows a single physical network interface to be divided into multiple virtual interfaces, each with its own dedicated hardware resources.

SR-IOV allows EC2 instances to bypass the EC2 hypervisor and communicate directly with the physical network interface, reducing the overhead and latency associated with virtualization.

To enable SR-IOV on your EC2 instances, first you'll need to choose an instance type that supports enhanced networking, such as the "c5n" or "m5n" instance types. After that you'll need to install the appropriate drivers and configure your operating system to use the virtual interfaces created by SR-IOV.

# Create a new security group that allows SSH access

aws ec2 create-security-group --group-name my-security-group --description "My security group"

aws ec2 authorize-security-group-ingress --group-name my-security-group --protocol tcp --port 22 --cidr 0.0.0.0/0

# Launch a new c5n.large instance with SR-IOV enabled

aws ec2 run-instances --image-id ami-0947d2ba12ee1ff75 --instance-type c5n.large --key-name my-key-pair --security-group-ids sg-0123456789abcdef0 --subnet-id subnet-0123456789abcdef0 --associate-public-ip-address --sriov-net-support simple

# Connect to the instance and install the ixgbevf driver

ssh ec2-user@<instance-public-ip>

sudo yum install -y kernel-devel-$(uname -r) gcc make

sudo mkdir /tmp/ixgbevf && cd /tmp/ixgbevf

sudo curl -O https://sourceforge.net/projects/e1000/files/ixgbevf%20stable/4.9.3/ixgbevf-4.9.3.tar.gz

sudo tar zxf ixgbevf-4.9.3.tar.gz && cd ixgbevf-4.9.3/src

sudo make install

# Load the ixgbevf kernel module

sudo modprobe ixgbevf

# Verify that the ixgbevf module is loaded and the VF interface is available

lsmod | grep ixgbevf

ip link show dev eth0

Placement Groups and Network Performance

Another way to optimize network performance on AWS is to use placement groups. Placement groups are a way to logically group EC2 instances based on their network performance requirements and physical proximity.

There are three types of placement groups available on AWS:

Cluster placement groups: Instances in a cluster placement group are placed in a single Availability Zone and are connected to a low-latency, high-bandwidth network. This type of placement group is ideal for applications that require high network performance and low latency, such as high-performance computing (HPC) and big data workloads.
Partition placement groups: Instances in a partition placement group are placed in logical segments called partitions, with each partition in a separate rack with its own network and power source. This type of placement group is ideal for large distributed and replicated workloads, such as Hadoop and Cassandra clusters.
Spread placement groups: Instances in a spread placement group are placed on distinct underlying hardware, with each instance in a separate Availability Zone. This type of placement group is ideal for applications where high availability is the priority, since it means a disaster that affects only parts of an AZ is very unlikely to affect all your EC2 instances at the same time.

The way to go about this is to first create a placement group and then select that placement group when you launch your EC2 instances.

AWS Global Accelerator and Network Performance

Another way to optimize network performance on AWS is to use AWS Global Accelerator. Global Accelerator is a networking service that improves the availability and performance of your applications by using the AWS global network to route traffic to the nearest regional endpoint.

With Global Accelerator, you can create accelerators that direct traffic to one or more endpoints in different AWS regions. Global Accelerator uses anycast IP addresses to route traffic to the nearest healthy endpoint, which helps to reduce latency and improve performance for users around the world.

Monitoring and Troubleshooting Network Performance

Don't optimize for network performance prematurely. First, use these tools to determine whether there's a problem:

Amazon CloudWatch: CloudWatch can monitor network performance metrics like network throughput, latency, and packet loss, and set alarms to notify you when performance degrades.
VPC Flow Logs: It allows you to monitor network traffic at the interface or subnet level, and gain insights into traffic patterns, security issues, and performance bottlenecks.

Once you've determined that there's a network problem, only then start optimizing.

Conclusion

Advanced networking on AWS is a complex and multifaceted topic. Which is a fancy way of saying that was a lot. Still, I believe it's important that you have at least a high level idea of this stuff.

We've covered a wide range of things, from network architecture, subnetting strategies and security best practices to hybrid connectivity and performance optimization. We also talked about several AWS services like Transit Gateway and Amazon VPC. But as with everything in AWS, it's really complex, and constantly evolving.

The key takeaway here is that you don't need all of this stuff. Maybe you don't need any of this stuff! But if you ever do, now you know how this stuff works.

Did you like this issue?

Loved it! 💖 | It was good 🙂 | No bueno 😑

Reply

or to participate.