For the past week, we’ve had this pesky routing problem in the CTO Advisor Hybrid Infrastructure (CTOAHI) data center. We finally had some time to take a look at it. We needed to make a disruptive network change to try to fix it. We’ve prepped the environment for the past few weeks as we’ve migrated all of the storage to our shiny new HPE Alletra storage array. Finally, we felt confident, and it was at this point it all went wrong.
Before the HPE array, we ran all our virtual machines (VMs) on VMware vSAN. Without knowing a lot about vSAN, just know that losing all network connectivity is bad. So, our mitigation is to have it on a proper storage array.
So, we went ahead with the network change. All of the VMs lost connectivity to the storage array. That was an expected risk. What wasn’t expected? The DNS ran on a VM that was on the storage array. Our playbook has a process where we connect to the vSphere host running the DNS VM and manually boot it.
We didn’t realize that the new storage created a new dependency on DNS. The vSphere hosts required DNS resolution to connect to the storage array to boot the DNS VM… A Veeam restore later, I had the DNS VM back up, but the damage was done.
We are deploying a physical DNS server as redundancy to avoid this problem in the future. However, the hard lesson is reevaluating the dependencies when migrating critical systems. We might have avoided this if we had done a tabletop exercise.