Thank you for working so hard to fix this issue, we really appreciate it
It’s all the ghosts we haven’t slayed! They’re haunting the webcode.
It feels like server overload to me. If y’all haven’t changed anything, that’s a possibility. Have you checked any load balancing or server utilization stats? I’ve been getting these errors for weeks now, although admittedly it’s gotten worse in the last few days
They recently had that update overhauling decks so maybe its related to that?
Didn’t want to be negative all the time, so just letting you know that it’s working perfectly today! Didn’t have one issue! Good work!
We managed to track down what the issue was. There haven’t been any periods of errors since we made the changes yesterday so everything should not only be back to normal but be better performance-wise than before. We apologize to everyone for the inconvenience it caused!
Details about the issue for those curious
We suddenly out of the had periods of high errors but no slow response times. We hadn’t made any recent changes within the 48hrs or so before that. They also were intermittent.
Our trouble shooting journey:
Step 1: Check the logs. Our logs didn’t give us much help just “trouble connecting to the DB”.
Step 2: Google. But Stack Overflow wasn’t actually helpful as most of it was “did you actually create and setup your database” (of course we did it was working just yesterday) or “just dump the db and rebuild” ( )
Step 3: Turn it off and back on again. Because as any good developer knows just trying this can sometimes solves the problem. This actually did help but the problem returned periodically during the day.
Step 4: Implement some DB improvements to remove unused indexes etc.
Step 5: More Google. Seems it might be something on our DB provider’s end regarding shared buffer memory in their cloud services.
Step 6: Reach out to them. Their support is terribly slow and their suggestions were things we already tried.
Step 7: More Google Maybe figure out that if we increase the RAM for the DB by upgrading to a higher tier it might help.
Step 8: Might as well try, nothing else is working. 3.5x our RAM just to be safe.
Step 9: Wait.
Since the upgrade, the error hasn’t occurred again.
In retrospect, despite only using a small portion of the actual DB storage, something like 25%, we do have tables with 10s and 100s of millions of rows. That combined with the fact we get on average 40-50 requests per second to the servers meant the DB had a lot if connections and couldn’t cache everything it wanted to and was having to use swap and basically tied up all of its memory preventing new connections, thus the errors.
Looking at key metrics from before and after, there is definitely improvement not only in memory usage but also in response time and cache hit ratio.
This isn’t an issue we have ever had or thought we were close to having. All in all, it was a good learning experience and we now know what metrics to keep an eye on and more importantly how to prevent the issue in the future.
Thank you for coming to DB trouble shooting Ted Talk.
Why is it always RAM?
Either way, I’m happy it’s so smooth now (I am even having less connection errors on my work computer which is a huge W).
Glad you all figured out the issue!
Pretty much
Thank you so much! Im so glad its working smoother now and so far so good