The AWS VPC field manual: a layered guide

The VPC itself: a private slice of AWS

Everything that follows lives inside this one object. Think of the VPC as a fenced-off parcel of land you've rented in an AWS Region. Nothing leaks in or out of it unless you explicitly wire up a door.

A Virtual Private Cloud is a logically isolated network, scoped to a single AWS Region (e.g. us-east-1, eu-west-2). Inside it, you control the IP address space, routing, DNS, and which traffic is allowed in and out. Outside it, nothing can reach your resources.

Every AWS account gets one default VPC per region, pre-wired with public subnets in every Availability Zone. For anything serious you create your own; the default has sensible demo settings, not production ones.

A VPC does not cross regions. If you need resources in us-east-1 and eu-west-2 to talk, that's a second VPC and either VPC Peering, Transit Gateway, or a private backbone connection.

One VPC per environment, one environment per account. That's the modern default.

CIDR blocks: the address math

Before you launch anything, you decide how many IP addresses your network gets. CIDR notation is how you say it: 10.0.0.0/16 is both a starting address and a net mask in one breath.

CIDR (Classless Inter-Domain Routing) describes a range of IPs as base/prefix. The prefix — the number after the slash — tells you how many bits are fixed. Everything else is host addresses you can assign.

A /16 fixes the first 16 bits, leaving 16 bits free, for 65,536 addresses. A /24 fixes 24 bits, leaving 8, for 256 addresses. Smaller prefix number means a bigger network.

AWS accepts VPC CIDRs between /16 (65k IPs) and /28 (16 IPs). Use private ranges defined in RFC 1918: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16. Never invent your own — they might collide with real internet IPs.

AWS also reserves 5 IPs per subnet for its own plumbing (network address, VPC router, DNS, future use, broadcast). A /28 subnet has 16 total, leaving 11 usable.

# A sensible allocation for one environment
VPC            10.0.0.0/16    # 65,536 IPs total
├─ public-a    10.0.1.0/24    # 256 (251 usable)
├─ public-b    10.0.2.0/24
├─ private-a   10.0.10.0/24
├─ private-b   10.0.11.0/24
├─ db-a        10.0.20.0/24
└─ db-b        10.0.21.0/24

Availability Zones: the physical layer

An AZ is an isolated data centre (or cluster of them) within a region. Separate power, separate cooling, separate network. If one AZ catches fire, the other one keeps serving traffic, if you built for it.

Most regions have three or more AZs, identified as us-east-1a, us-east-1b, etc. Your VPC spans the whole region, but each subnet lives inside exactly one AZ.

The single most important architectural rule: for every workload that needs to survive a data-centre failure, put copies in at least two AZs. An Application Load Balancer does this automatically across the subnets you attach to it. RDS does it if you enable Multi-AZ. EC2 Auto Scaling does it if you list multiple subnets.

AZs inside one region are connected by high-bandwidth, low-latency private fibre. Traffic between them is fast (single-digit ms) but not free — cross-AZ data transfer is billed.

Subnets: public and private

A subnet is a slice of your VPC's CIDR, locked to one AZ. The label “public” or “private” isn't a setting; it's a consequence of how you've wired its route table.

A subnet holds a chunk of IP addresses. Anything launched in it (EC2, RDS, Lambda ENI, EKS pod) is assigned one of those IPs. That's all a subnet mechanically is: an address pool attached to an AZ.

The public/private distinction comes from its route table (more on that below). If that route table has a default route to an Internet Gateway, the subnet is public. If it doesn't, it's private: resources inside can't be reached directly from the internet, and can't reach out without a NAT.

A common production pattern uses three tiers per AZ: public (load balancers, bastions, NATs), private-app (your services), and private-db (RDS, ElastiCache). The app tier talks to the DB tier over a security-group rule, never across a public network.

Public subnet

Route to Internet Gateway
Resources get public IPs or Elastic IPs
Holds ALB, NAT Gateway, bastion
Never your databases

Private subnet

No route to IGW
Only private IPs
Holds app servers, databases
Egress via NAT Gateway

Internet Gateway: the front door

Exactly one IGW attaches to a VPC. It's the only path in or out of the public internet for resources that have public IPs. It's not a physical thing — it's a managed, horizontally-scaled, highly available piece of AWS plumbing.

The Internet Gateway does two jobs. Outbound: it performs 1:1 NAT, rewriting your public IP to the instance's private IP. Inbound: it routes traffic arriving at your public IP to the right ENI inside your VPC.

An IGW is not a firewall. It enforces no rules of its own — it just passes packets along if the route table, NACLs, and security groups all say yes. It also does not support bandwidth limits or logging on its own.

Important: a resource is only reachable through the IGW if it has a public IPv4 address or Elastic IP, AND it sits in a subnet whose route table points 0.0.0.0/0 at the IGW. Both conditions have to hold. Either missing means no internet.

One VPC, one Internet Gateway. Don't overthink it.

Route Tables: the rules of the road

If subnets are neighbourhoods, the route table is the street map pinned to each one's wall. It tells every packet leaving the subnet where to go next.

Every subnet has exactly one route table associated with it. The table is just a list of entries: “for any traffic heading to CIDR X, forward it to target Y”. Targets can be an IGW, a NAT GW, a VPC Peering connection, a Transit Gateway, a VPC Endpoint, or local (the VPC's own CIDR, which is always present and can't be removed).

When a packet needs to leave, AWS scans the table for the most specific match and uses that. A 10.0.5.0/24 rule wins over a 10.0.0.0/16 rule wins over 0.0.0.0/0. The 0.0.0.0/0 entry is the default route, the fallback.

The only thing that makes a subnet “public” is the existence of a 0.0.0.0/0 → igw-xxxx row in its route table. Change that row and the subnet changes role.

# public route table, attached to public subnets
Destination        Target
10.0.0.0/16        local          # always there
0.0.0.0/0          igw-0a1b2c     # public-making row

# private route table
10.0.0.0/16        local
0.0.0.0/0          nat-09f8e7     # outbound only

# database route table (fully isolated)
10.0.0.0/16        local          # no egress at all

NAT Gateway and Elastic IPs: the outbound-only valve

Your private subnet's servers need to reach the internet to download packages, call third-party APIs, hit S3 over public endpoints. They can't have public IPs (that would make them public). The NAT Gateway is the way.

A NAT Gateway lives in a public subnet. Private instances route their default traffic to it; the NAT rewrites the source IP to its own (an Elastic IP), forwards the packet out through the IGW, remembers the connection, and routes the reply back to the originator. Traffic initiated from the internet can never reach private instances this way — the state table has no entry for it.

An Elastic IP (EIP) is a static public IPv4 address owned by your account. You allocate it once and it stays yours until you release it. NAT Gateways require one. You can also attach EIPs to EC2 instances that need a fixed public address that survives reboots.

Two gotchas worth internalising. First, NAT Gateways are per-AZ. If you put one in AZ-a and private subnets in AZ-b route to it, you pay cross-AZ data transfer and lose isolation. Production pattern: one NAT per AZ, each private subnet routes to its local NAT.

Second, NAT Gateways cost money even when idle (roughly $32/month plus data processing). For small workloads a NAT Instance (a self-managed EC2 doing the same job) can be cheaper, though you lose the managed HA.

Network ACLs: the subnet bouncer

A NACL is a list of allow/deny rules evaluated at the subnet boundary. It's stateless, which is the one thing you must remember about it. Every packet is judged on its own merits, even the reply.

Rules are numbered, low to high, and the first match wins. Each direction (inbound and outbound) has its own list. At the end of both is an implicit * DENY ALL.

Because NACLs are stateless, if you allow inbound TCP on port 443 you also have to allow outbound on the ephemeral port range (1024–65535) so the response packet can leave. This trips people up constantly. Forget the return rule and the handshake silently dies.

The default NACL that AWS creates for you allows everything in and out. Most teams never touch it and let security groups do the work. Custom NACLs are for specific threats — blocking a bad IP range at the subnet edge, or meeting compliance requirements that ask for defence-in-depth at layer 4.

NACL  INBOUND
#100  ALLOW  tcp/443         0.0.0.0/0
#110  ALLOW  tcp/80          0.0.0.0/0
#120  ALLOW  tcp/1024-65535  (ephemeral)
#200  DENY   tcp/22          203.0.113.7/32
*     DENY   all             (implicit)

NACL  OUTBOUND
#100  ALLOW  tcp/443         0.0.0.0/0
#110  ALLOW  tcp/80          0.0.0.0/0
#120  ALLOW  tcp/1024-65535  (reply traffic!)
*     DENY   all

Security Groups: the instance firewall

Security groups wrap individual ENIs (so, effectively, instances). They're stateful: they remember every connection and auto-allow the reply. This one property is the reason most people never bother with NACLs.

Security groups are allow-only. There's no such thing as a deny rule. You list what's allowed in, what's allowed out, and everything else is denied. If a security group has no inbound rules at all, no inbound traffic reaches the instance. Period.

The best feature of security groups is that they can reference other security groups, not just IPs. Your app SG can say “allow port 5432 from sg-database-clients”. Attach that client SG to your app servers and they get access, automatically, no IPs to maintain. This is how you build clean tiered architectures.

A resource can have up to 5 security groups attached (soft limit, raisable). The rules across them are evaluated as a union — if any SG allows the traffic, it's allowed.

# Classic three-tier layout

sg-alb    inbound  tcp/443   from  0.0.0.0/0
sg-app    inbound  tcp/8080  from  sg-alb
sg-db     inbound  tcp/5432  from  sg-app

# Only the ALB faces the world. The app only answers the ALB.
# The DB only answers the app. Clean chain of trust.

Bastion host: the jump box

Your private instances have no public IPs. How do you SSH into them to debug something at 2am? You jump through a small, hardened host in the public subnet, the bastion.

A bastion host is a tiny EC2 instance in a public subnet whose only job is to accept your SSH connection and forward you on to a private instance. Keep it boring: minimal image, no extra services, SSH only, aggressive security group (only your office IP or corporate VPN range, ideally).

Because it's the single path in, it's also the single choke point to monitor. Every login, logged. Every key, rotated. Every session, auditable. That's the point — defence gains from concentration.

Modern alternative: AWS Systems Manager Session Manager. It lets you open a shell on any instance (even one with no public IP and no open SSH port) through IAM, without a bastion. Many teams now treat the bastion as legacy and use Session Manager for everything — less surface area, richer logs.

A bastion you never touch is better than a bastion you forget to patch.

Ingress and egress: tracing a full request

“Ingress” is traffic entering a resource; “egress” is traffic leaving. Here's the full path a single HTTPS request takes from a user's browser to a private EC2 instance, and back.

Browser sends HTTPS to the ALB's public DNS name.
Packet hits the Internet Gateway (allowed; the ALB is in a public subnet with a route).
Subnet NACL inbound checks: allow 443.
ALB's security group checks: allow 443 from 0.0.0.0/0.
ALB picks a healthy target (private EC2) and opens a new connection.
Private subnet NACL inbound: allow ephemeral range.
EC2's security group: allow 8080 from sg-alb.
App processes the request, opens a DB connection to RDS.
RDS's security group: allow 5432 from sg-app.
Reply comes back. Security groups auto-allow the return (stateful); NACLs need the ephemeral rule (stateless).
ALB sends the HTTPS response back through the IGW to the user.

If the app needs to call a third-party API, that's the egress path: EC2 → private route table → 0.0.0.0/0 → NAT Gateway → IGW → internet. Security group egress rules on the EC2 must allow it. NACL outbound rules on the private subnet must allow it. NACL inbound on the same subnet must allow the reply's ephemeral port.

Beyond the basics: four more things worth knowing

VPC Endpoints

Talk to S3, DynamoDB, or other AWS services without your traffic ever touching the public internet. Two flavours: Gateway endpoints (free, S3 and DynamoDB only, added to your route table) and Interface endpoints (paid, most AWS services, backed by an ENI in your subnet).

Load balancers

ALB (layer 7): HTTP/HTTPS routing, path rules, host headers, OIDC. The default for web apps. NLB (layer 4): TCP/UDP, extreme performance, static IPs, for non-HTTP workloads. Both span multiple AZs automatically when you attach multi-AZ subnets.

VPC Peering

A direct link between two VPCs (even across accounts or regions). Traffic is private, not transitive (A↔B and B↔C doesn't give you A↔C), and requires route-table entries on both sides. Great for two VPCs, unwieldy for twenty.

Transit Gateway

The grown-up version of Peering. A single regional hub that connects many VPCs, VPNs, and Direct Connect links, with centralised routing. Essential once you have more than five VPCs or multi-account setups.

The one-glance cheat sheet

VPC. Your private slice of one AWS region. Isolated by default.
CIDR. Address range as base/prefix. /16 = 65k IPs, /24 = 256. AWS reserves 5 per subnet.
AZ. Isolated data centre. Always build across at least two.
Subnet. CIDR slice locked to one AZ. Public/private determined by its route table.
Internet Gateway. One per VPC. The only path to and from the public internet.
Route Table. Says where traffic leaving a subnet should go. Most specific match wins.
NAT Gateway. Lets private instances call outbound. One per AZ for production.
Elastic IP. A static public IPv4. Required by NAT GW, optional for EC2.
NACL. Stateless subnet-level firewall. Remember to allow ephemeral ports.
Security Group. Stateful per-ENI firewall. Allow-only. Can reference other SGs.
Bastion. Hardened SSH jump box in public subnet. Often replaced by SSM.
Ingress / Egress. Into and out of a resource. Egress rules control what your app can call.
VPC Endpoint. Reach AWS services privately without an IGW or NAT.
Load Balancer. ALB for HTTP, NLB for TCP. Distribute traffic across AZs.
Peering / TGW. Peering = 2 VPCs, cheap. TGW = many VPCs, central hub.
Rule of thumb. If you're stuck, it's almost always a security group, NACL, or route-table issue. Check in that order.

Part two of this manual continues with the moving parts inside the network: serverless and CI/CD.