BIRCH Backend Production Deployment Runbook

Last updated: 2026-06-17

This document captures the production deployment findings from the BIRCH agent document API enhancement rollout. It is intended for Iron Horse agents/operators so future work starts from the actual production topology instead of assumptions.

Executive summary

The BIRCH backend code changes were not the main problem. The deployment path had drifted from reality.

Key finding:

birchbackend.ihrailsoftware.com → AWS Application Load Balancer → EC2 target group → Docker container

The live production backend was not directly represented by the public DNS IPs, and it was not running from the ECR image that GitHub Actions was pushing. Production was running a Docker Hub image on a healthy ALB target.

The API enhancement is now live through a blue/green-style cutover:

Live ALB target: i-00f8b6f49c1e4edac:5056
New container:    birch_backend_candidate
Image:            018772930825.dkr.ecr.us-east-2.amazonaws.com/birch_backend:latest

The old container remains running as rollback:

Rollback container: birch_backend
Old image:          mithuniht/birch_backend:latest
Old port:           5055

Production endpoint

https://birchbackend.ihrailsoftware.com

Agent document API:

POST https://birchbackend.ihrailsoftware.com/agent/workorder-documents

Existing health/data route used by the ALB:

GET https://birchbackend.ihrailsoftware.com/get_all_common_data

Current verified production behavior

Verified after cutover:

GET /get_all_common_data → 200
POST /agent/workorder-documents without auth → 401 Unauthorized
POST /agent/workorder-documents with valid token → 200 document response

ARR-500B returns plain text. BRC, invoice, BBOM, and work-order return PDFs. bill_to: all can return a zip.

AWS account and region

AWS account: 018772930825
Region:      us-east-2

Hopper has approved admin/operator access for Iron Horse AWS work. The local key path is documented in Hopper memory; never paste or print AWS secrets in chat, task logs, commits, or docs.

Actual production AWS topology

Load balancer

Name: birchbackend-lb
Type: Application Load Balancer
DNS:  birchbackend-lb-2146404271.us-east-2.elb.amazonaws.com

The public hostname birchbackend.ihrailsoftware.com resolves to the ALB, not directly to EC2 instance public IPs.

Target group

Name: birchbackend-tg
ARN:  arn:aws:elasticloadbalancing:us-east-2:018772930825:targetgroup/birchbackend-tg/7699030c55ef2088
Health check path: /get_all_common_data
Original health settings:
  interval: 300 seconds
  timeout: 5 seconds
  healthy threshold: 5
  unhealthy threshold: 2

Current target state after cutover

i-00f8b6f49c1e4edac:5056 → healthy, active production target
i-00f8b6f49c1e4edac:5055 → old target, deregistered/draining after cutover
i-024cdbf3b2bcc7d66:5055 → pre-existing unhealthy target, not touched during rollout

Current live EC2 target

Instance ID: i-00f8b6f49c1e4edac
Name:        backup_main_server
Private IP:  172.31.2.27
Public IP:   3.133.116.32
SSM:         Online

Wrong / misleading instances encountered

These were not the live BIRCH backend ALB target during the rollout:

i-0eda428e451ace506
i-0a0120c318cf20f68

i-0a0120c318cf20f68 was SSM Online and running, but it was not serving the BIRCH backend. It was running other containers such as Trackage/ReaderAdmin.

Do not assume an EC2 instance is the BIRCH backend target just because it is running or SSM Online. Always inspect the ALB target group.

Current Docker state on live backend host

On i-00f8b6f49c1e4edac:

Production container:
  name:  birch_backend_candidate
  image: 018772930825.dkr.ecr.us-east-2.amazonaws.com/birch_backend:latest
  port:  5056 host → 5055 container
  ALB:   registered and healthy

Rollback container:
  name:  birch_backend
  image: mithuniht/birch_backend:latest
  port:  5055 host → 5055 container
  ALB:   deregistered/draining, container intentionally left running

The old container was deliberately left up during cutover to avoid downtime and preserve rollback.

Agent document API secret

The production bearer token is stored in AWS Systems Manager Parameter Store:

/birch/prod/agent-document-api-key

Type:

SecureString

Retrieve only when needed, and do not print the value:

BIRCH_DOC_TOKEN=$(aws ssm get-parameter \
  --region us-east-2 \
  --name /birch/prod/agent-document-api-key \
  --with-decryption \
  --query Parameter.Value \
  --output text)

The running production candidate container has this value injected as:

BIRCH_AGENT_DOCUMENT_API_KEY

What went wrong during deployment

1. GitHub Actions was not aligned with real production

The workflow could build, scan, and push the Docker image to ECR, but production was actually running a Docker Hub image:

mithuniht/birch_backend:latest

The new CI image was pushed to:

018772930825.dkr.ecr.us-east-2.amazonaws.com/birch_backend:latest

Pushing to ECR alone did not update production.

2. SSH deploy from GitHub Actions was not viable

GitHub-hosted runners could not SSH into the backend host. Port 22 timed out/refused depending on target. This was a network/access problem, not a build problem.

3. Initial SSM target assumptions were wrong

Some supplied/guessed instances were running or SSM Online, but not in the BIRCH backend ALB target group. The correct source of truth is the ALB target group, not DNS IPs or instance names.

4. Public DNS pointed to ALB IPs

birchbackend.ihrailsoftware.com resolved to ALB IPs. Searching EC2 instances by those public IPs returned nothing, because they are load balancer addresses, not instance addresses.

5. Backend host did not have AWS CLI

The live backend host had Docker but no AWS CLI, so a workflow step that expected the instance to run:

aws ecr get-login-password ...

would not work as written.

The workaround was to generate the ECR login token from Hopper using approved AWS credentials and pass it through SSM for docker login.

6. Runtime API key was missing until created

The new route intentionally fails closed if BIRCH_AGENT_DOCUMENT_API_KEY is missing.

Observed candidate behavior before secret injection:

POST /agent/workorder-documents → 503 BIRCH_AGENT_DOCUMENT_API_KEY is not configured

After creating /birch/prod/agent-document-api-key and injecting it into the candidate container:

unauthenticated request → 401 Unauthorized
authenticated request   → 200 document response

7. A stale unhealthy target already existed

The target group already contained:

i-024cdbf3b2bcc7d66:5055 → unhealthy / Target.Timeout

This was pre-existing and was not modified during the rollout.

Branch and promotion policy

BIRCH development is governed by this mandate:

staging → feature/work branch → staging merge → staging testing → approved-hours production promotion

Rules:

Start every BIRCH development task from the current staging branch.
Create a new feature/work branch from staging for the change.
Merge completed work back into staging for testing.
Do not merge to production/main until staging testing is complete.
Production promotion happens only during Iron Horse approved hours.
Production/client-impacting promotion requires Derek or responsible-operator approval.
Never develop from main / master / production.
Never push implementation changes directly to main / master / production.
Emergency rollback or hotfix exceptions require explicit approval and must be documented afterward.
Every enhancement should update this knowledgebase before or with production promotion.

Correct future deployment sequence

Use this sequence for future BIRCH backend production deploys.

1. Identify current ALB target

aws elbv2 describe-target-health \
  --region us-east-2 \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-2:018772930825:targetgroup/birchbackend-tg/7699030c55ef2088

Confirm which target is healthy and serving traffic.

2. Inspect the current live container via SSM

Use SSM on the healthy target instance. Do not stop anything during inspection.

Useful read-only commands:

sudo docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}'
curl -sS -o /tmp/probe -w '%{http_code}\n' http://127.0.0.1:<port>/get_all_common_data

3. Stage new container beside live

Do not replace the current container first.

Recommended staging pattern:

current live host port: 5055
candidate host port:   5056 or another unused port

Run candidate with:

NODE_ENV=production
BIRCH_AGENT_DOCUMENT_API_KEY=<from SSM SecureString>

4. Verify candidate locally

Before touching ALB:

curl -sS -o /tmp/common -w '%{http_code}\n' http://127.0.0.1:5056/get_all_common_data

Expected:

Unauthorized API check:

curl -sS -o /tmp/noauth -w '%{http_code}\n' \
  -X POST http://127.0.0.1:5056/agent/workorder-documents \
  -H 'content-type: application/json' \
  --data '{"work_order":"WO_05926","document_type":"arr-500b","bill_to":"owner"}'

Expected:

Authenticated API check should return 200 and document bytes.

5. Confirm security group allows ALB to candidate port

If using a new host port, ensure the instance security group allows ALB security groups to reach it.

During rollout, narrow ALB-source rules were added for port 5056 on:

Instance SG: sg-02ca172f50d836137
ALB SGs:
  sg-0bf2bd978762a56e1
  sg-08bb6d397658848ca
  sg-0a0c8958c85f7fe63

There was also an existing broad 0.0.0.0/0 rule for 5056 on another attached SG. Prefer narrow ALB-source rules for future cleanup/hardening.

6. Register candidate target in ALB

aws elbv2 register-targets \
  --region us-east-2 \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-2:018772930825:targetgroup/birchbackend-tg/7699030c55ef2088 \
  --targets Id=i-00f8b6f49c1e4edac,Port=5056

Wait until candidate is healthy.

Optional temporary health check acceleration is allowed, but restore original values afterward.

7. Cut over only after candidate is healthy

Once candidate is healthy, deregister the old target from ALB:

aws elbv2 deregister-targets \
  --region us-east-2 \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-2:018772930825:targetgroup/birchbackend-tg/7699030c55ef2088 \
  --targets Id=i-00f8b6f49c1e4edac,Port=5055

Do not stop the old container immediately. Leave it running for rollback until production has been observed stable.

8. Verify public production

curl -fsS https://birchbackend.ihrailsoftware.com/get_all_common_data >/tmp/common.json

Document API checks:

No auth → 401
Valid auth → 200 document response

Rollback procedure

If the new target on 5056 fails after cutover:

Re-register the old target:

aws elbv2 register-targets \
  --region us-east-2 \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-2:018772930825:targetgroup/birchbackend-tg/7699030c55ef2088 \
  --targets Id=i-00f8b6f49c1e4edac,Port=5055

Wait for i-00f8b6f49c1e4edac:5055 to become healthy.
Deregister the new target:

aws elbv2 deregister-targets \
  --region us-east-2 \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-2:018772930825:targetgroup/birchbackend-tg/7699030c55ef2088 \
  --targets Id=i-00f8b6f49c1e4edac,Port=5056

Keep both containers until traffic is stable and logs are reviewed.

Cleanup / follow-up recommendations

Update GitHub Actions deploy workflow to target the real ALB/instance path:
- target i-00f8b6f49c1e4edac unless architecture changes
- inject BIRCH_AGENT_DOCUMENT_API_KEY from Parameter Store
- do not assume AWS CLI exists on the instance, or install/manage it explicitly
- use SSM from GitHub with known-good AWS credentials
Decide whether production should standardize on ECR or Docker Hub:
- Current new deployment uses ECR.
- Old production used Docker Hub.
- Avoid split-brain image sources.
Clean up stale/unhealthy ALB target if confirmed unused:

i-024cdbf3b2bcc7d66:5055

Decide when to stop/remove rollback container:

birch_backend on port 5055

Do not remove it until Derek or the responsible operator approves.

Review security group exposure on port 5056:
- narrow ALB-source rules were added
- a broader 0.0.0.0/0 rule for 5056 already existed on another attached SG
- consider removing the broad rule if not required elsewhere
Rename the production container once stable, if desired:
- current name birch_backend_candidate is accurate for the rollout but awkward for long-term operations
- if renaming, use blue/green procedure again to avoid downtime

Agent/API usage instructions:

docs/agent-document-api-agent-instructions.md

API reference:

docs/agent-document-api.md

Executive summary​

Production endpoint​

Current verified production behavior​

AWS account and region​

Actual production AWS topology​

Load balancer​

Target group​

Current target state after cutover​

Current live EC2 target​

Wrong / misleading instances encountered​

Current Docker state on live backend host​

Agent document API secret​

What went wrong during deployment​

1. GitHub Actions was not aligned with real production​

2. SSH deploy from GitHub Actions was not viable​

3. Initial SSM target assumptions were wrong​

4. Public DNS pointed to ALB IPs​

5. Backend host did not have AWS CLI​

6. Runtime API key was missing until created​

7. A stale unhealthy target already existed​

Branch and promotion policy​

Correct future deployment sequence​

1. Identify current ALB target​

2. Inspect the current live container via SSM​

3. Stage new container beside live​

4. Verify candidate locally​

5. Confirm security group allows ALB to candidate port​

6. Register candidate target in ALB​

7. Cut over only after candidate is healthy​

8. Verify public production​

Rollback procedure​

Cleanup / follow-up recommendations​

Related docs​