What is going on with the platform today?

I’ve just had another connection error appear at the end of doing my reviews. After a couple of resubmits it eventually went through but the site seems a lot slower loading still.

1 Like

Got a connection error again while doing my reviews

1 Like

This has been happening to me for the last 3 days!

3 Likes

I am struggling once again. I can barely do one review and then I have to reload the page. It was better about 5 hours ago, but right now it’s really bad again.

2 Likes

Just wanted to let you know I am also experiencing issues, it keeps asking me to try reloading the page. Thankfully it doesn’t seem to be losing my reviews but it is quite frustrating as has been happening for a few days now.

1 Like

Yeah, seems to be happening still. I just submitted an answer and got a crazy ASCII face, too, which has never happened before.

1 Like

I’m also having issues, I had them on Monday and seems to be happening again today.

1 Like

Also having issues tonight. I was having connection errors almost after every review. Whole site was also kinda slow and it took about 10 minutes to update the reviews i had done.

1 Like

Same here. Errors submitting, errors pulling new questions, my review session resetting to include only 1 grammar term, and strangely slow loading of my stats on the homepage.

1 Like

Same here, site also seems to be going offline periodically.

1 Like

Just letting you all know we are discussing the issues with our DB provider. I took the site offline into maintenance mode just a bit ago to implement a change. We will keep monitoring it and trying to figure out the issue.

Like I mentioned above it hasn’t been an easy issue to resolve, mainly because the errors we are getting aren’t something we have ever dealt with before and they seemingly started happening out of the blue without any changes on our end. Not to mention they just randomly pop up without any rhyme or reason.

11 Likes

Thank you for working so hard to fix this issue, we really appreciate it :slight_smile:

3 Likes

It’s all the ghosts we haven’t slayed! :sob: They’re haunting the webcode. :frowning_face_with_open_mouth:

7 Likes

It feels like server overload to me. If y’all haven’t changed anything, that’s a possibility. Have you checked any load balancing or server utilization stats? I’ve been getting these errors for weeks now, although admittedly it’s gotten worse in the last few days

1 Like

They recently had that update overhauling decks so maybe its related to that?

1 Like

Didn’t want to be negative all the time, so just letting you know that it’s working perfectly today! Didn’t have one issue! :slight_smile: Good work!

3 Likes

We managed to track down what the issue was. There haven’t been any periods of errors since we made the changes yesterday so everything should not only be back to normal but be better performance-wise than before. We apologize to everyone for the inconvenience it caused! :bowing_man:

Details about the issue for those curious

We suddenly out of the had periods of high errors but no slow response times. We hadn’t made any recent changes within the 48hrs or so before that. They also were intermittent.

Our trouble shooting journey:

Step 1: Check the logs. Our logs didn’t give us much help just “trouble connecting to the DB”.
Step 2: Google. But Stack Overflow wasn’t actually helpful as most of it was “did you actually create and setup your database” (of course we did it was working just yesterday) or “just dump the db and rebuild” (:upside_down_face: )
Step 3: Turn it off and back on again. Because as any good developer knows just trying this can sometimes solves the problem. This actually did help but the problem returned periodically during the day.
Step 4: Implement some DB improvements to remove unused indexes etc.
Step 5: More Google. Seems it might be something on our DB provider’s end regarding shared buffer memory in their cloud services.
Step 6: Reach out to them. Their support is terribly slow and their suggestions were things we already tried.
Step 7: More Google :sob: Maybe figure out that if we increase the RAM for the DB by upgrading to a higher tier it might help.
Step 8: Might as well try, nothing else is working. 3.5x our RAM just to be safe.
Step 9: Wait.

Since the upgrade, the error hasn’t occurred again.

In retrospect, despite only using a small portion of the actual DB storage, something like 25%, we do have tables with 10s and 100s of millions of rows. That combined with the fact we get on average 40-50 requests per second to the servers meant the DB had a lot if connections and couldn’t cache everything it wanted to and was having to use swap and basically tied up all of its memory preventing new connections, thus the errors.

Looking at key metrics from before and after, there is definitely improvement not only in memory usage but also in response time and cache hit ratio.

This isn’t an issue we have ever had or thought we were close to having. All in all, it was a good learning experience and we now know what metrics to keep an eye on and more importantly how to prevent the issue in the future.

Thank you for coming to DB trouble shooting Ted Talk.

21 Likes

Why is it always RAM? :weary:

Either way, I’m happy it’s so smooth now (I am even having less connection errors on my work computer which is a huge W).

3 Likes

Glad you all figured out the issue!

3 Likes

Pretty much :joy:

image

10 Likes