Serverless and CI/CD: a payments stack walkthrough

Does serverless even need a VPC?

It's the first question people ask after reading part one. The answer is: sometimes, and less often than you'd think.

Lambda runs in AWS-managed infrastructure by default, no VPC attached. That's ideal: cold starts are fastest, no ENI management, no NAT costs. You only attach a Lambda to a VPC when it needs to reach something private: an RDS instance, an ElastiCache cluster, an on-prem network over VPN, or an internal service in a private subnet.

DynamoDB, SQS, SNS, SES, S3 all live on the AWS network and are accessed over public endpoints by default. If your security posture requires private-only access, that's where VPC Endpoints come in: Gateway endpoints for S3 and DynamoDB (free), Interface endpoints for the rest (paid, one ENI per AZ).

CodeBuild similarly runs outside your VPC unless you attach it. You'd attach it when builds need to talk to a private artifact repo or run integration tests against private infrastructure.

Rule of thumb: start outside the VPC. Move in only when a dependency forces you.

CI/CD: from commit to running Lambda

Four AWS services chained together deliver your code with zero manual intervention. CodeCommit holds it, CodeBuild tests and bundles it, CodePipeline orchestrates everything, CodeDeploy ships it safely.

CodeCommit

CodeCommit is AWS's hosted Git: private repositories, no size limit on files, branches, pull requests, triggers, all the usual Git surface. Authentication runs through IAM instead of SSH keys or passwords, which means repo access is managed with the same policies as every other AWS resource.

When to reach for it: you want code colocated with AWS IAM for unified access control, compliance requires code-at-rest encryption with your KMS keys, or you need native triggers to CodePipeline without webhooks.

Note: in 2024 AWS stopped onboarding new customers to CodeCommit. Many teams now use GitHub or GitLab with CodePipeline; the patterns are identical. Treat this section as “the source stage,” regardless of provider.

Key concepts: branches (merge to main triggers prod pipeline), approval rules (N reviewers required before merge), triggers (push to branch fires a CloudWatch event that starts the pipeline).

In a payments stack, each service typically lives in its own repo — payments-authorize, payments-capture, payments-refund, payments-webhooks. Pushes to main kick off that service's pipeline. Approval rules require one reviewer from the core team before any merge to main.

# configure once
git config --global credential.helper '!aws codecommit credential-helper $@'
git config --global credential.UseHttpPath true

# clone + push like any git repo
git clone https://git-codecommit.us-east-1.amazonaws.com/v1/repos/payments-authorize
cd payments-authorize
git checkout -b feat/3ds-fallback
git commit -am "fall back to 3DS on issuer step-up"
git push origin feat/3ds-fallback

CodeBuild

CodeBuild is the “runs your build commands” service. Spin up a container, check out your code, run whatever buildspec.yml tells it to run, upload the result as an artifact, shut down. You pay per build minute — no idle cost.

Buildspec is a YAML file in your repo that defines the build phases: install (tooling), pre_build (setup, linting), build (compile, bundle), post_build (run tests, emit metadata), and artifacts (what to hand to the next stage).

Worth knowing: it runs in a clean container each time (reproducible, no state leaks), supports custom Docker images for exotic toolchains, can attach to a VPC if tests need private resources, and supports caching dependencies in S3 or local cache to speed up cold builds.

The authorize Lambda bundle, for example, is built by installing production Python dependencies into a build directory, running unit tests against mocked gateway and fraud-service responses, running an integration suite against a LocalStack container, then zipping the source plus vendored deps into authorize-bundle.zip for deploy.

version: 0.2

phases:
  install:
    runtime-versions:
      python: 3.12
    commands:
      - pip install --upgrade pip
      - pip install -r requirements.txt
      - pip install -r requirements-dev.txt
  pre_build:
    commands:
      - ruff check .
      - pytest tests/unit --cov=src --cov-report=term-missing
  build:
    commands:
      - mkdir -p build
      - cp -r src/* build/
      - pip install -r requirements.txt -t build/
  post_build:
    commands:
      - pytest tests/integration
      - cd build && zip -r ../authorize-bundle.zip .

artifacts:
  files:
    - authorize-bundle.zip
    - appspec.yml

cache:
  paths:
    - '/root/.cache/pip/**/*'

CodePipeline

CodePipeline is the orchestrator. It defines an ordered sequence of stages, each containing one or more actions, with artifacts flowing between them. It doesn't do the work itself — it invokes CodeCommit, CodeBuild, CodeDeploy, Lambda, CloudFormation, ECS, and around 40 other services.

The mental model: a pipeline is a directed graph of stages. A stage groups actions that run in parallel. An action is one unit of work (a CodeBuild run, a CodeDeploy deployment, a manual approval step). Artifacts are the inputs and outputs of actions, typically zip files stored in an S3 artifact bucket.

A common stage layout:

Source. Fetch from CodeCommit or GitHub.
Build. CodeBuild compiles, tests, bundles.
DeployDev. CodeDeploy ships to a dev alias, run smoke tests.
Approval. Manual gate for prod.
DeployProd. CodeDeploy shifts prod traffic gradually.

For a payments service, any push to main of payments-authorize runs through a 5-stage pipeline. Dev deploy is automatic. Prod deploy requires an on-call engineer to click “approve” — which prevents 3 AM accidents when nobody's watching the metrics.

CodeDeploy

CodeDeploy is the last mile. It takes a built artifact and rolls it out to targets — EC2, ECS, on-prem servers, and Lambda — using strategies that limit blast radius. For Lambda specifically, CodeDeploy shifts traffic between alias versions, not by replacing the function.

Lambda deployment strategies:

All-at-once. 100% traffic to new version instantly. Fast, risky.
Linear. e.g. 10% every 2 minutes over 20 minutes.
Canary. e.g. 10% for 5 minutes, then 100%. Smoke-test in prod.

CodeDeploy watches CloudWatch alarms during the shift. If an alarm fires (error rate, latency, custom metric), it automatically rolls back to the previous version.

The appspec.yml for Lambda points at the new version and optionally defines hooks — Lambdas that run before traffic shifts (BeforeAllowTraffic) to sanity-check the new version, and after (AfterAllowTraffic) to verify the deploy succeeded.

version: 0.0
Resources:
  - authorizeFunction:
      Type: AWS::Lambda::Function
      Properties:
        Name: payments-authorize
        Alias: live
        CurrentVersion: 42
        TargetVersion: 43

Hooks:
  - BeforeAllowTraffic: payments-authorize-preflight
  - AfterAllowTraffic:  payments-authorize-smoketest

Authorize updates roll out canary-style: 10% of payment traffic hits the new version for 5 minutes. A BeforeAllowTraffic hook fires a synthetic test transaction against the new version. If the test fails, or if the real-traffic DeclineRate alarm trips, CodeDeploy reverts in seconds — long before customers see broken checkouts.

Compute: Lambda and Step Functions

Lambda is your unit of execution. Step Functions is how you string multiple Lambdas into reliable, retryable workflows — the difference between a function and a business process.

Lambda

Lambda runs your code in response to events. You upload a function (a zip, a container image, or inline code); AWS handles the runtime, scaling, patching, and failure recovery. You're billed per invocation and per millisecond of CPU, at sub-cent rates. No invocation, no cost.

Invocation models

Synchronous. API Gateway, ALB, another Lambda: caller waits for the response.
Asynchronous. SNS, S3 events, EventBridge: fire-and-forget, Lambda retries on failure.
Poll-based. SQS, DynamoDB Streams, Kinesis: the Lambda service polls and invokes.

Things that bite people

Cold starts. First invocation after idle takes longer (100ms–2s depending on package size, VPC attachment).
15-minute timeout. Hard ceiling. For longer work, use Step Functions.
Stateless. Don't rely on /tmp persisting or on a specific container instance.
Concurrency limits. Default 1000 per account per region. Unhandled bursts throttle.

payments-authorize is a Lambda triggered by API Gateway for charge requests. Given a payment intent ID, it loads the customer's saved payment method from DynamoDB, runs a fraud check via another Lambda, calls the upstream gateway for authorization, and returns the auth result — all in under 200ms at the 99th percentile.

import json
import boto3

# client hoisted OUT of handler, reused across warm invocations
ddb = boto3.client('dynamodb', region_name='us-east-1')

def handler(event, context):
    body = json.loads(event['body'])
    payment_intent_id = body['paymentIntentId']

    response = ddb.get_item(
        TableName='payment-intents',
        Key={'paymentIntentId': {'S': payment_intent_id}},
    )
    item = response.get('Item')

    if not item:
        return {'statusCode': 404, 'body': 'unknown payment intent'}

    fraud_score = run_fraud_check(item)
    if fraud_score['block']:
        return {'statusCode': 402, 'body': 'declined'}

    auth = authorize_with_gateway(item)
    return {'statusCode': 200, 'body': json.dumps(auth)}

Key runtime defaults worth tuning: memory at 512 MB (CPU scales with memory), timeout at 10s (default 3s is often too short), runtime python3.12, architecture arm64 (20% cheaper than x86_64), and reserved concurrency to cap downstream blast radius.

Step Functions

Step Functions is a state machine that calls other services in a defined order, with native support for retries, error handling, parallel branches, conditionals, and wait states. You describe the workflow in Amazon States Language (ASL, JSON); AWS runs it durably — if a Lambda crashes, the state machine resumes from exactly where it was.

Without Step Functions you end up writing the same ugly code in every Lambda: try/catch, retry with backoff, store progress somewhere, handle partial failures. Step Functions removes all of that. The workflow is the retry logic. It's how you build processes that cross 15-minute Lambda limits or span hours.

Two flavours:

Standard. Up to 1 year, exactly-once, full history, ~$0.025 per 1k transitions. Use for business workflows.
Express. Up to 5 min, at-least-once, higher throughput, 100× cheaper. Use for event processing.

A checkout is a multi-step process spanning seconds to minutes: validate the cart, run a fraud check, authorize the card with the upstream gateway, write a ledger entry, send the receipt. Each step is a Lambda. Step Functions orchestrates the chain with retries (the gateway is occasionally flaky), compensations (void the auth if the ledger write fails), and a catch-all that alerts ops on unrecoverable errors.

{
  "Comment": "Checkout authorization workflow",
  "StartAt": "ValidateIntent",
  "States": {

    "ValidateIntent": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:validate-intent",
      "Next": "FraudCheck",
      "Catch": [{
        "ErrorEquals": ["UnknownIntent"],
        "Next": "Fail"
      }]
    },

    "FraudCheck": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:fraud-check",
      "Retry": [{
        "ErrorEquals": ["States.TaskFailed"],
        "IntervalSeconds": 2,
        "MaxAttempts": 3,
        "BackoffRate": 2.0
      }],
      "Next": "AuthorizeAndRecord"
    },

    "AuthorizeAndRecord": {
      "Type": "Parallel",
      "Branches": [
        { "StartAt": "AuthorizeWithGateway", ... },
        { "StartAt": "WriteLedgerEntry",     ... }
      ],
      "Next": "SendReceipt"
    },

    "SendReceipt": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ses:sendEmail",
      "End": true
    },

    "Fail": { "Type": "Fail" }
  }
}

Data: DynamoDB and S3

DynamoDB holds the hot operational state your Lambdas read on every request. S3 holds everything else — artifacts, archives, exports, logs, anything that's large or cold.

DynamoDB

DynamoDB is a fully-managed NoSQL key-value and document database. Tables are indexed by a primary key, either a partition key alone or a partition key plus a sort key, and scale horizontally by partitioning on that key. Read/write latency is measured in single-digit milliseconds, regardless of whether the table holds 1,000 items or 100 billion.

Model your access patterns first, then design the schema.

Unlike SQL, you cannot JOIN your way out of a bad schema. Every query needs to hit a partition key or a secondary index. The best teams enumerate every read/write pattern on a whiteboard before touching CreateTable.

Essential concepts:

Partition key (pk). Determines physical partitioning — queries must specify it.
Sort key (sk). Optional; enables range queries within a partition.
GSI (Global Secondary Index). Different pk/sk to query the same data another way.
Streams. Every item change fires a record; Lambda can consume them in real time.
On-demand vs provisioned. Pay per request, or pay for reserved throughput.

The transactions table uses transactionId as the partition key — every authorization does exactly one GetItem. A GSI on customerId handles dashboard lookups by user. DynamoDB Streams fan every status change out to a ledger Lambda via SNS, so the ledger entry is written within seconds of the charge, not at the end of a nightly batch.

# Reading a transaction by id
response = ddb.get_item(
    TableName='transactions',
    Key={'transactionId': {'S': 'txn_01HX9V8ZQK...'}},
    ConsistentRead=True,   # capture-after-auth needs strong reads
)
item = response.get('Item')

# Writing with a conditional (optimistic concurrency)
ddb.update_item(
    TableName='transactions',
    Key={'transactionId': {'S': 'txn_01HX9V8ZQK...'}},
    UpdateExpression='SET #s = :new, version = version + :inc',
    ConditionExpression='version = :expected',   # no race
    ExpressionAttributeNames={'#s': 'status'},
    ExpressionAttributeValues={
        ':new':      {'S': 'CAPTURED'},
        ':inc':      {'N': '1'},
        ':expected': {'N': '7'},
    },
)

S3

S3 stores objects — any blob of bytes up to 5 TB — in buckets. Objects have keys (like file paths), metadata, and versioning. Durability is 11 nines: AWS claims they will lose one object out of 100 billion per year.

Storage classes

Standard. Millisecond access, the default.
Infrequent Access. Cheaper storage, per-GB retrieval fee.
Glacier Flexible Retrieval. Archive, minutes-to-hours to retrieve.
Glacier Deep Archive. Cheapest, 12–48h retrieval, for regulatory retention.
Intelligent-Tiering. S3 moves objects between tiers automatically based on access.

S3 as an event source

Uploading an object can trigger a Lambda (directly, or via SNS/SQS/EventBridge). This is the cornerstone of serverless ETL: drop a file, transformation happens. PUT → Lambda → transformed object → PUT to another bucket.

Settlement files generated by the gateway are written hourly to s3://payments-settlements/raw/YYYY/MM/DD/HH/ as Parquet files. An S3 PUT event triggers a Lambda that enriches each row with order data from DynamoDB and republishes to s3://payments-settlements/enriched/. Objects older than 90 days transition to Glacier Deep Archive — a PCI-DSS audit requirement keeps them for 7 years.

S3 also plays a CI/CD role: it's where CodePipeline stashes build artifacts (authorize-bundle.zip) between stages. It's where CodeDeploy pulls new Lambda code from. It's where CloudFormation templates live. It's the glue of the AWS world.

// Writing an object with content-type + metadata
await s3.send(new PutObjectCommand({
  Bucket: 'payments-settlements',
  Key: `raw/2026/04/20/11/gateway-${Date.now()}.parquet`,
  Body: parquetBuffer,
  ContentType: 'application/vnd.apache.parquet',
  Metadata: { 'source': 'gateway', 'record-count': '14821' },
  ServerSideEncryption: 'aws:kms',
  SSEKMSKeyId: 'alias/payments-data'
}));

// Lifecycle rule: archive after 90d, delete after 7y
{
  "Rules": [{
    "ID": "settlement-retention",
    "Status": "Enabled",
    "Prefix": "enriched/",
    "Transitions": [{
      "Days": 90,
      "StorageClass": "DEEP_ARCHIVE"
    }],
    "Expiration": { "Days": 2555 }
  }]
}

Messaging: SQS, SNS, and SES

Decoupling is the first principle of scalable systems. SQS buffers work. SNS fans events out. SES delivers customer communication. All three are fully managed and pay-per-use.

SQS

SQS is a managed message queue. A producer pushes messages; a consumer (typically a Lambda) pulls them and acknowledges them. Between push and ack, the message is invisible to other consumers — if the consumer crashes without acking, the message becomes visible again and another consumer picks it up.

Two queue types:

Standard. At-least-once delivery, unordered, best-effort de-duplication, unlimited throughput.
FIFO. Exactly-once, strictly ordered within a message group, up to 3k/s per group.

Queues decouple the producer from the consumer. If the consumer is slow or down, the queue absorbs the load without back-pressuring the producer. If the consumer scales faster than the producer, no harm done. Adding a queue is the single cheapest thing you can do to make a system resilient.

Dead-letter queues (DLQs): after N failed processing attempts (you pick N), SQS can move a message to a separate DLQ. This is how you catch “poison pills” — malformed messages that crash your handler. Always configure a DLQ, and always alarm on DLQ depth above zero.

When a batch of 50,000 refunds is imported from a vendor reconciliation file, each row becomes a message on refund-jobs.fifo (grouped by merchantId so related refunds process in order). A Lambda polls up to 10 messages per invocation, issues them through the gateway, and deletes them on success. Failures go to refund-jobs-dlq after 3 attempts; an operator inspects those manually.

# Lambda handler consuming an SQS batch
import json
import logging

log = logging.getLogger()

def handler(event, context):
    failures = []

    for record in event['Records']:
        try:
            job = json.loads(record['body'])
            issue_refund(job)
        except Exception as err:
            log.exception('failed %s: %s', record['messageId'], err)
            failures.append({'itemIdentifier': record['messageId']})

    # partial batch response, only failed ones go back to the queue
    return {'batchItemFailures': failures}

SNS

SNS is a pub/sub topic. Publishers send messages to the topic; every subscriber gets a copy. Subscribers can be Lambda functions, SQS queues, HTTP endpoints, email addresses, SMS numbers, or other SNS topics.

SQS has one logical consumer per message — messages are processed once. SNS has many subscribers; every one gets its own copy. The common pattern: SNS → multiple SQS queues, one per downstream team. Each team consumes at its own pace from its own queue.

Subscribers can specify filter policies (JSON) that match against message attributes. Only messages matching the filter are delivered. This keeps teams from writing their own “is this my message?” logic in every handler.

The payment-events topic broadcasts status changes — AUTHORIZED, CAPTURED, REFUNDED, DISPUTED. Subscribers include the ledger Lambda (all events), the fraud detector (AUTHORIZED only), the analytics pipeline's SQS queue (all events), and the ops Slack integration (DISPUTED only, so on-call hears about chargebacks).

# Publishing an event
import json

sns.publish(
    TopicArn='arn:aws:sns:...:payment-events',
    Message=json.dumps({
        'transactionId': transaction_id,
        'previous': 'PENDING',
        'now': 'AUTHORIZED',
    }),
    MessageAttributes={
        'eventType':  {'DataType': 'String', 'StringValue': 'AUTHORIZED'},
        'amountTier': {'DataType': 'String', 'StringValue': 'HIGH'},
    },
)

# Subscriber filter policy (only DISPUTED on high-value txns)
{
  "eventType":  ["DISPUTED"],
  "amountTier": [{ "anything-but": "LOW" }]
}

SES

SES sends email — transactional (receipts, notifications, password resets) and bulk (marketing, newsletters). You verify the sender domain, set up DKIM/SPF/DMARC, and call SendEmail or SendTemplatedEmail from anywhere: Lambda, Step Functions, an EC2 cron.

New SES accounts start in sandbox mode: you can only send to verified addresses, and rate is capped. You request production access through a support ticket; AWS checks that you have a legitimate use case and haven't been used by spammers. This matters to plan for — don't discover it the day before launch.

SES supports Handlebars-style templates stored server-side. You call SendTemplatedEmail with the template name and a data blob; SES renders and sends. Keeps email copy out of your Lambda code and makes localisation feasible.

SES publishes bounces and complaints to an SNS topic. You must subscribe to it and suppress bad addresses — sustained bounce rates above 5% or complaint rates above 0.1% get your account put on review.

The last step of a successful checkout is an SES SendTemplatedEmail call with template receipt-v3, rendering the order summary, line items, total, and a link to the order page. The template is localised by a locale field (en-US / es-MX). Bounces go to an SNS topic consumed by a Lambda that flags the delivery in DynamoDB and escalates repeated bounces to customer success.

ses.send_email(
    FromEmailAddress='no-reply@payments.example',
    Destination={'ToAddresses': [customer.email]},
    Content={
        'Template': {
            'TemplateName': 'receipt-v3',
            'TemplateData': json.dumps({
                'firstName':  customer.first_name,
                'orderId':    order.id,
                'total':      order.total,
                'receiptUrl': f'https://payments.example/orders/{order.id}',
                'locale':     customer.locale,
            }),
        },
    },
    ConfigurationSetName='payments-receipts',  # enables bounce -> SNS
)

End to end: a payment flows through the lot

Every service above, in one story. This is what you build once you understand each piece.

Customer clicks “Pay” at checkout. Request hits API Gateway → λ start which calls StartExecution on the Step Functions workflow. The API returns a tracking ID immediately.
Validate intent. Lambda reads payment-intents from DynamoDB (pk = paymentIntentId). If the intent is in PENDING status and not expired, proceed. Otherwise, fail the workflow with a clear error.
Fraud check. Lambda calls the fraud service synchronously with the customer, amount, device fingerprint, and IP. A high score short-circuits the flow into a clean decline; a borderline score triggers a 3DS step-up handled in a sub-state.
Parallel authorize and record. One branch authorizes the card with the upstream gateway; another writes a pending entry to the ledger. Each has its own retry/backoff. If either branch fails after max retries, the catch handler voids the auth and rolls the ledger row back.
Publish event. On success, publish an AUTHORIZED event to the SNS topic payment-events. The ledger reconciler, the fraud feedback loop, and the analytics pipeline all get the event in parallel, with no coupling to each other.
Send receipt. The final state sends the receipt via SES: template receipt-v3, order summary, total, and a link to the order page, localised by customer locale. An audit record is also written to S3 with full workflow history for PCI-DSS retention.

The state machine is durable, each step retries independently, and the whole flow runs in 5–30 seconds end-to-end — or under a second for the synchronous authorize path when no asynchronous steps are needed.