You’ve used Lambda, S3, and RDS to build a successful application on AWS. However, as your application grows and becomes more successful, you may start to run into problems related to load and scaling. Is the problem with an underperforming service? Are you in the wrong AWS region? Is your application design not up to the task? These issues can be challenging to navigate and may leave you wondering what your next move.
In this video, I discussed some solutions that developers and developer groups can consider when facing these types of problems. One potential solution that I suggest is hiring a Site Reliability Engineer (SRE). An SRE is a specialized engineer responsible for ensuring that an application is highly available, reliable, and performing well. They work closely with developers and operations teams to identify and resolve issues related to load and scaling. Google popularized the idea of the SRE and you can find substantial research on the practice.
What if you could apply AWS’ years of developing and troubleshooting applications against your application infrastructure and operations? What if an AWS SRE could look at your Cloudwatch data alert and direct your developers to troublespots in code or your AWS VPC? How much would you pay for that expertise? While that level of professional services may be cost-prohibitive, AWS offers an alternative in DevOpsGuru.
DevOpsGuru is an ML-driven assistant that automates many of the tasks an SRE undertakes. DevOpsGuru scans AWS logs for known inefficiencies and other issues. It prioritizes and alerts based on the findings. I haven’t used the service. However, I have to imagine, like most AI, it’s an augmentation of human capability vs. a replacement. In the case of a small operations team, DevOpsGuru may mitigate the need to hire a full-time SRE. It doesn’t forgive an operations team from doing much of the hygiene required to ensure smooth operations.