I love debugging. You know the one where you cant for the love of god figure out why something is happening the way it is or why it's happening in a way it shouldn’t? Exactly, those!
So here is a series of posts in a "story" format of my adventures.
If you like reading Twitter threads, here is the whole post on Twitter
Story: A Tale of Memory Limits and Technical Resilience
Our SaaS has a core flow where data is created using CSV uploads. The business logic in this is huge and it takes a lot to process so obviously, we do it in a background job.
Support reported any CSV bigger than ~200 lines is failing, every time on a specific tenant. Now we have partially moved to K8s, and some tenants are still on an old setup(run on EC2, using custom management; salt stack, supervisor...), the issue was on an old setup.
Now, as always we opened up the NewRelic and Sidekiq(our background processor) dashboard and started looking for clues. Nil. Nothing is getting logged anywhere... Let's keep looking!
We tried replicating the issue on prod and soon found the Sidekiq process was getting killed midway. Okay, we have something to go on finally.
The first guess was it's a memory or CPU issue; Ruby is infamous for its memory consumption, right? But wait, NewRelic graphs were normal, 20% mem and 40% CPU usage. Hmmm... that's odd.
Random ques started popping up, does Sidekiq have a mem limit per process/thread? Is there some kind of sidekiq_killer script running in the background?
Ans to both was, NO.
We started looking at OOM(out-of-memory killer) activity logs and found the same PIDs there(woohoo!) Okay, so we know for sure it's a memory thing... but what!?
The usual steps for this kind of issue are to profile the code => check flame graphs => find and fix whatever is causing the memory bloat.
But as I mentioned, this was not an easy task, due to extensive logic within this workflow. We tried optimizing by disabling query caching to save memory, it helped a bit but not a lot. CSV will fail after 250 lines now. Now the only way forward was to refactor this code and break the CSV upload process into smaller parts. This will take some time so one dev started working on this immediately...
Something still kept bugging us, why is it getting killed if total memory usage is low? So instead of relying on NewRelic, we started looking into the VM... again.
Reading through the kernel-related messages, we saw a message like "Memory cgroup stats for /mem_limit". After a lot of back and forth with GPT we finally found where the mem_limit is being set for this cgroup. Upon checking it turns out the mem limit was set way too low ~2.8GB while the machine has 32GB. Thats crazy!
We increased the mem_limit for this cgroup to 16GB and voila! Things started working properly! But how did this happen in the first place!?
Turns out we have a custom script that sets all the limits and whatnot, one of the things it's supposed to do is whenever the machine is scaled up or down it should tune the mem_limits for the cgroups accordingly!
We had last scaled up this machine in May, but sadly the limits were untouched since then! Recap:
- Debugging a CSV upload
- To profiling code
- To disabling query caching
- To refactoring the flow
- Finally landed on memory_limit fix for group which worked!
Real-life debugging is pretty messy. The more situations you have to debug, the better you become.