Some VMware servers lost storage

Resolved

This will not be the final Postmortem in this issue, i just want to share what is happening right now and how we are working together with our storage vendor to reach a permanent fix.

The root cause of this is a bug in the storage controller software, we have just recently been able to confirm the bug in question. Unrelated to the VMware and storage platform we have been having issues with our internal DNS resolver cluster, as a part of a chain reaction when the resolver cluster is performing badly or is having an outage, the NFS-services in our storage solution gets "swamped" and is using up all its threads.

That means that the solution is no longer able to serve data.

Going forward we are working on 2 fronts:

  • Ensure the reliability of the resolver-cluster, we have a new design and we are working to implement it.
  • Making a plan together with the storage vendor to get the correct patch and fix for the DNS-bug in place. I hope i will have more info on this during the afternoon. I am making sure we have the highest attention at our vendors, this issue is not acceptable.

Above all, we are working our hardest to make sure that our services are running smoothly and we are aware of the big impact this have on our customers.

I will get back to you as soon as possible with more info, in the meantime, if there is any questions please send them to support@glesys.se and we will answer right away.

Again, we are fully commited to make things right.

Edit 5th of April @ 19:13

This is the plan going forward:

  • The storage vendor are working on a patch that adress this issue, it is expected do be ready för QA by the end of this week. Depending on how big the patch is QA is going to take more or less time.

The patch will be ready for deployment in our environment at the lastest on the 18th of April, but we will schedule a maintenance window as soon as the patch is ready even if it is before the 18th.

About 70% of the vendors customers are running their environment in the same manner as we do, but only a small subset of the customers are experiencing the same issues, this makes it a bit of a corner case and very hard to troubleshoot.

Regarding the DNS-cluster, we have made some changes that will make sure that it stays available, this in combination with a work around that we have implemented in the VMware-platform we hope will keep the system stable.

We will be fully transparent on why this could happen and at the moment we are collecting info from our own systems and from the vendor to be able to be as through as possible in the coming report.

We will receive more info on when the patch will be ready tomorrow, i will update this post as soon as we know.

-Andreas B, Head of the VMware platform.

Resolved

No reports of any more problems, closing incident.

Monitoring

We have identified and solved to problem for now. We are closely monitoring the systems and are working with our storage vendor to implement a permanent solution for the problem.

Identified

We've identified that some virtual machines in our VMware cluster have lost the connection to our storage solution and needs to be restarted.

We're currently going through the affected servers and restarting the ones with lost storage and will make sure that they start up correctly. If you're experiencing problems related to this feel free to email support@glesys.se and we will help you.

Began at: