Reddit Experience · Apr 2026

Docker Swarm Broke Production and I Can't Fix It

3 upvotes 5 replies

Interview Experience

Docker Swarm Broke Production and I Can't Fix It Broke the production with just one line command. I have accidentally ran the docker swarm to build and deploy the whole app instead of the specific ser

Full Details

Docker Swarm Broke Production and I Can't Fix It Broke the production with just one line command. I have accidentally ran the docker swarm to build and deploy the whole app instead of the specific service that I wanted to update. I didn't notice this until it was deploying about 30 seconds in and trigger an SSH time-out, meaning the server is overwhelmed (memory exhaustion) and shut down. So technically, even though it didn't successfully deployed, I believed it was the root cause of that. Even on reboot or restart, it keeps trying to build the whole apps. I do not have access to the server console to try and reboot or detach and roll back my volume. Basically, I couldn't do anything now but to give context on what I did and what potentially break the server. But for some reason, the team that is managing all the servers decided to terminate the instance which now causes all our database and files to be wiped out. Before this, I have suggested them to send me the error logs to know if it is my docker compose issue (I wanted to prevent this by adding some more rules and instructions if that is the case). I also suggested them to modify how the instance would start up by manually telling the instance to not build the image or drop some service on start. So now we have to redeploy the whole infrastructure again, and the higher-ups are really mad at me for causing this. My supervisor tried to take the blame, but I feel really guilty about it. For context, the development team is basically just me and my supervisor, who is a non technical person. I primarily only do the code and rarely did the deployment. The whole tech stack and deployment used to be done by another person. It has been more than 24 hours, and we have yet to redeploy a whole infrastructure. I can't do anything now but just mentally writing my resignation. On a side note, is there a way to prevent the docker or instance to even start the build if they estimated the resources are gonna be insufficient? I am really new to all this :(

Free preview. Unlock all questions →

Topics

Stack Queue