Solving Chat Message Loss in Multi-Instance Environment

Posted Oct 20, 2025 Updated Jan 10, 2026

By Junyoung Yang

read 2 min

Problem Discovery

For the Kakao Tech Campus final project, I implemented team chat with WebSocket. Worked fine locally.

Then I deployed to AWS ECS, and once auto-scaling kicked in with 2+ instances, things got weird:

WebSocket connections are bound to a specific instance.

Client A → Connected to Instance 1
Client B → Connected to Instance 2

When Client A sends a message, it’s processed by Instance 1—but Client B is on Instance 2. Instance 1 has no idea where Client B is.

I needed to sync messages across instances. I looked at a few options:

Since chat messages were also being stored in the DB, Pub/Sub’s lack of persistence wasn’t an issue. I went with Redis Pub/Sub.

To share this decision with the team, I wrote an ADR (Architecture Decision Record).

Key Decisions:

Risks:

Implementation alone wasn’t enough—I needed to verify that messages weren’t lost during actual auto-scaling.

On ALB + Fargate + ECS, I simulated 100 concurrent chat users with JMeter.

Normal state: 100% message consistency with 2 instances
Scale out: Traffic spike → scaled to 3 instances → new instance received messages correctly
Scale in: Traffic drop → reduced instances → remaining instances worked fine

Code that works on a single instance isn’t guaranteed to work in a distributed environment
“It works” and “it works reliably” are different—verify under load
Architecture decisions need justification—document it with ADR so you can explain later

From Kakao Tech Campus 3rd cohort final project (student schedule management service).
Backend
Frontend

This post is licensed under CC BY 4.0 by the author.