BIRCH Backend Production Deployment Runbook
Last updated: 2026-06-17
This document captures the production deployment findings from the BIRCH agent document API enhancement rollout. It is intended for Iron Horse agents/operators so future work starts from the actual production topology instead of assumptions.
Executive summary
The BIRCH backend code changes were not the main problem. The deployment path had drifted from reality.
Key finding:
birchbackend.ihrailsoftware.com → AWS Application Load Balancer → EC2 target group → Docker container
The live production backend was not directly represented by the public DNS IPs, and it was not running from the ECR image that GitHub Actions was pushing. Production was running a Docker Hub image on a healthy ALB target.
The API enhancement is now live through a blue/green-style cutover:
Live ALB target: i-00f8b6f49c1e4edac:5056
New container: birch_backend_candidate
Image: 018772930825.dkr.ecr.us-east-2.amazonaws.com/birch_backend:latest
The old container remains running as rollback:
Rollback container: birch_backend
Old image: mithuniht/birch_backend:latest
Old port: 5055
Production endpoint
https://birchbackend.ihrailsoftware.com
Agent document API:
POST https://birchbackend.ihrailsoftware.com/agent/workorder-documents
Existing health/data route used by the ALB:
GET https://birchbackend.ihrailsoftware.com/get_all_common_data
Current verified production behavior
Verified after cutover:
GET /get_all_common_data → 200
POST /agent/workorder-documents without auth → 401 Unauthorized
POST /agent/workorder-documents with valid token → 200 document response
ARR-500B returns plain text. BRC, invoice, BBOM, and work-order return PDFs. bill_to: all can return a zip.
AWS account and region
AWS account: 018772930825
Region: us-east-2
Hopper has approved admin/operator access for Iron Horse AWS work. The local key path is documented in Hopper memory; never paste or print AWS secrets in chat, task logs, commits, or docs.
Actual production AWS topology
Load balancer
Name: birchbackend-lb
Type: Application Load Balancer
DNS: birchbackend-lb-2146404271.us-east-2.elb.amazonaws.com
The public hostname birchbackend.ihrailsoftware.com resolves to the ALB, not directly to EC2 instance public IPs.
Target group
Name: birchbackend-tg
ARN: arn:aws:elasticloadbalancing:us-east-2:018772930825:targetgroup/birchbackend-tg/7699030c55ef2088
Health check path: /get_all_common_data
Original health settings:
interval: 300 seconds
timeout: 5 seconds
healthy threshold: 5
unhealthy threshold: 2
Current target state after cutover
i-00f8b6f49c1e4edac:5056 → healthy, active production target
i-00f8b6f49c1e4edac:5055 → old target, deregistered/draining after cutover
i-024cdbf3b2bcc7d66:5055 → pre-existing unhealthy target, not touched during rollout
Current live EC2 target
Instance ID: i-00f8b6f49c1e4edac
Name: backup_main_server
Private IP: 172.31.2.27
Public IP: 3.133.116.32
SSM: Online
Wrong / misleading instances encountered
These were not the live BIRCH backend ALB target during the rollout:
i-0eda428e451ace506
i-0a0120c318cf20f68
i-0a0120c318cf20f68 was SSM Online and running, but it was not serving the BIRCH backend. It was running other containers such as Trackage/ReaderAdmin.
Do not assume an EC2 instance is the BIRCH backend target just because it is running or SSM Online. Always inspect the ALB target group.
Current Docker state on live backend host
On i-00f8b6f49c1e4edac:
Production container:
name: birch_backend_candidate
image: 018772930825.dkr.ecr.us-east-2.amazonaws.com/birch_backend:latest
port: 5056 host → 5055 container
ALB: registered and healthy
Rollback container:
name: birch_backend
image: mithuniht/birch_backend:latest
port: 5055 host → 5055 container
ALB: deregistered/draining, container intentionally left running
The old container was deliberately left up during cutover to avoid downtime and preserve rollback.
Agent document API secret
The production bearer token is stored in AWS Systems Manager Parameter Store:
/birch/prod/agent-document-api-key
Type:
SecureString
Retrieve only when needed, and do not print the value:
BIRCH_DOC_TOKEN=$(aws ssm get-parameter \
--region us-east-2 \
--name /birch/prod/agent-document-api-key \
--with-decryption \
--query Parameter.Value \
--output text)
The running production candidate container has this value injected as:
BIRCH_AGENT_DOCUMENT_API_KEY
What went wrong during deployment
1. GitHub Actions was not aligned with real production
The workflow could build, scan, and push the Docker image to ECR, but production was actually running a Docker Hub image:
mithuniht/birch_backend:latest
The new CI image was pushed to:
018772930825.dkr.ecr.us-east-2.amazonaws.com/birch_backend:latest
Pushing to ECR alone did not update production.
2. SSH deploy from GitHub Actions was not viable
GitHub-hosted runners could not SSH into the backend host. Port 22 timed out/refused depending on target. This was a network/access problem, not a build problem.
3. Initial SSM target assumptions were wrong
Some supplied/guessed instances were running or SSM Online, but not in the BIRCH backend ALB target group. The correct source of truth is the ALB target group, not DNS IPs or instance names.
4. Public DNS pointed to ALB IPs
birchbackend.ihrailsoftware.com resolved to ALB IPs. Searching EC2 instances by those public IPs returned nothing, because they are load balancer addresses, not instance addresses.
5. Backend host did not have AWS CLI
The live backend host had Docker but no AWS CLI, so a workflow step that expected the instance to run:
aws ecr get-login-password ...
would not work as written.
The workaround was to generate the ECR login token from Hopper using approved AWS credentials and pass it through SSM for docker login.
6. Runtime API key was missing until created
The new route intentionally fails closed if BIRCH_AGENT_DOCUMENT_API_KEY is missing.
Observed candidate behavior before secret injection:
POST /agent/workorder-documents → 503 BIRCH_AGENT_DOCUMENT_API_KEY is not configured
After creating /birch/prod/agent-document-api-key and injecting it into the candidate container:
unauthenticated request → 401 Unauthorized
authenticated request → 200 document response
7. A stale unhealthy target already existed
The target group already contained:
i-024cdbf3b2bcc7d66:5055 → unhealthy / Target.Timeout
This was pre-existing and was not modified during the rollout.
Branch and promotion policy
BIRCH development is governed by this mandate:
staging → feature/work branch → staging merge → staging testing → approved-hours production promotion
Rules:
- Start every BIRCH development task from the current
stagingbranch. - Create a new feature/work branch from
stagingfor the change. - Merge completed work back into
stagingfor testing. - Do not merge to production/main until staging testing is complete.
- Production promotion happens only during Iron Horse approved hours.
- Production/client-impacting promotion requires Derek or responsible-operator approval.
- Never develop from
main/master/ production. - Never push implementation changes directly to
main/master/ production. - Emergency rollback or hotfix exceptions require explicit approval and must be documented afterward.
- Every enhancement should update this knowledgebase before or with production promotion.
Correct future deployment sequence
Use this sequence for future BIRCH backend production deploys.
1. Identify current ALB target
aws elbv2 describe-target-health \
--region us-east-2 \
--target-group-arn arn:aws:elasticloadbalancing:us-east-2:018772930825:targetgroup/birchbackend-tg/7699030c55ef2088
Confirm which target is healthy and serving traffic.
2. Inspect the current live container via SSM
Use SSM on the healthy target instance. Do not stop anything during inspection.
Useful read-only commands:
sudo docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}'
curl -sS -o /tmp/probe -w '%{http_code}\n' http://127.0.0.1:<port>/get_all_common_data
3. Stage new container beside live
Do not replace the current container first.
Recommended staging pattern:
current live host port: 5055
candidate host port: 5056 or another unused port
Run candidate with:
NODE_ENV=production
BIRCH_AGENT_DOCUMENT_API_KEY=<from SSM SecureString>
4. Verify candidate locally
Before touching ALB:
curl -sS -o /tmp/common -w '%{http_code}\n' http://127.0.0.1:5056/get_all_common_data
Expected:
200
Unauthorized API check:
curl -sS -o /tmp/noauth -w '%{http_code}\n' \
-X POST http://127.0.0.1:5056/agent/workorder-documents \
-H 'content-type: application/json' \
--data '{"work_order":"WO_05926","document_type":"arr-500b","bill_to":"owner"}'
Expected:
401
Authenticated API check should return 200 and document bytes.
5. Confirm security group allows ALB to candidate port
If using a new host port, ensure the instance security group allows ALB security groups to reach it.
During rollout, narrow ALB-source rules were added for port 5056 on:
Instance SG: sg-02ca172f50d836137
ALB SGs:
sg-0bf2bd978762a56e1
sg-08bb6d397658848ca
sg-0a0c8958c85f7fe63
There was also an existing broad 0.0.0.0/0 rule for 5056 on another attached SG. Prefer narrow ALB-source rules for future cleanup/hardening.
6. Register candidate target in ALB
aws elbv2 register-targets \
--region us-east-2 \
--target-group-arn arn:aws:elasticloadbalancing:us-east-2:018772930825:targetgroup/birchbackend-tg/7699030c55ef2088 \
--targets Id=i-00f8b6f49c1e4edac,Port=5056
Wait until candidate is healthy.
Optional temporary health check acceleration is allowed, but restore original values afterward.
7. Cut over only after candidate is healthy
Once candidate is healthy, deregister the old target from ALB:
aws elbv2 deregister-targets \
--region us-east-2 \
--target-group-arn arn:aws:elasticloadbalancing:us-east-2:018772930825:targetgroup/birchbackend-tg/7699030c55ef2088 \
--targets Id=i-00f8b6f49c1e4edac,Port=5055
Do not stop the old container immediately. Leave it running for rollback until production has been observed stable.
8. Verify public production
curl -fsS https://birchbackend.ihrailsoftware.com/get_all_common_data >/tmp/common.json
Document API checks:
No auth → 401
Valid auth → 200 document response
Rollback procedure
If the new target on 5056 fails after cutover:
- Re-register the old target:
aws elbv2 register-targets \
--region us-east-2 \
--target-group-arn arn:aws:elasticloadbalancing:us-east-2:018772930825:targetgroup/birchbackend-tg/7699030c55ef2088 \
--targets Id=i-00f8b6f49c1e4edac,Port=5055
-
Wait for
i-00f8b6f49c1e4edac:5055to become healthy. -
Deregister the new target:
aws elbv2 deregister-targets \
--region us-east-2 \
--target-group-arn arn:aws:elasticloadbalancing:us-east-2:018772930825:targetgroup/birchbackend-tg/7699030c55ef2088 \
--targets Id=i-00f8b6f49c1e4edac,Port=5056
- Keep both containers until traffic is stable and logs are reviewed.
Cleanup / follow-up recommendations
-
Update GitHub Actions deploy workflow to target the real ALB/instance path:
- target
i-00f8b6f49c1e4edacunless architecture changes - inject
BIRCH_AGENT_DOCUMENT_API_KEYfrom Parameter Store - do not assume AWS CLI exists on the instance, or install/manage it explicitly
- use SSM from GitHub with known-good AWS credentials
- target
-
Decide whether production should standardize on ECR or Docker Hub:
- Current new deployment uses ECR.
- Old production used Docker Hub.
- Avoid split-brain image sources.
-
Clean up stale/unhealthy ALB target if confirmed unused:
i-024cdbf3b2bcc7d66:5055
- Decide when to stop/remove rollback container:
birch_backend on port 5055
Do not remove it until Derek or the responsible operator approves.
-
Review security group exposure on port
5056:- narrow ALB-source rules were added
- a broader
0.0.0.0/0rule for5056already existed on another attached SG - consider removing the broad rule if not required elsewhere
-
Rename the production container once stable, if desired:
- current name
birch_backend_candidateis accurate for the rollout but awkward for long-term operations - if renaming, use blue/green procedure again to avoid downtime
- current name
Related docs
Agent/API usage instructions:
docs/agent-document-api-agent-instructions.md
API reference:
docs/agent-document-api.md