CloudDevOpsFeaturedLet's TalkSecuritySREs

Automation And Chaos Engineering Make People Happier | Michael Cucchi, PagerDuty


Guest: Michael Cucchi (LinkedIn)
Company: PagerDuty (LinkedIn, Twitter)
Show: Let’s Talk

PagerDuty is a cloud-based service that maintains and manages critical urgent work and, as Michael Cucchi, Vice President of Product and Partner Marketing, puts it, “is known around the world as the world’s most efficient and rapid system, for mobilizing people to solve problems.”

The importance of automation in the cloud-native world is complicated because it’s such a broad term and Cucchi says it’s important to “define it down.” To that, Cucchi says, “The steps that a human being might have to take to solve a problem, logging into a cluster and a set of services, running commands on those servers, to figure out exactly the status of resources or how they’re serving the end digital service, and then obviously, if you identify a problem, being able to resolve it, without a human being and also without a super user, without a highly skilled person.”

PagerDuty recently announced a new product, called Dynamic Service Graph, which is a real-time view of a very complex environment and gives you a really clear picture of what’s powering your business and makes it easier to ensure it’s always running healthily.

Cucchi touches on the role chaos engineers play in this world when he says, “At PagerDuty, we do something called Failure Fridays, where we actually run simulated incidents and we practice, and then we do post mortems on how we behaved and what steps were taken.” He continues, “Were all the steps needed? Were we leveraging the technology to its fullest? Had we over-stressed certain individuals in the organization?” Cucchi concludes, “And so, assuming chaos can do two things, makes you resilient, makes you build better systems, but it actually makes you also build better practices, which makes your humans happier.”

Along those same lines, Cucchi brings up the importance of incident response when he says, “Whatever you can do to assume an incident, when you’re designing the code, the better. And assume that something’s going to go wrong, and at some point, a human’s going to have to get involved.”

The summary of the show is written by Jack Wallen

Here are some of the topics we covered in this show:

  • Intro to PagerDuty
  • What is the importance of automation in the cloud-native world which is becoming overly complicated?
  • What announcement did they recently make?
  • PagerDuty focuses a lot on the cultural or the human aspect of any technology. What role do practices like Chaos Engineering play in this world?
  • What kind of awareness is there about practices like these, especially when it comes to the Incident Response Plan?
Read Transcript
Don't miss out great stories, subscribe to our newsletter.

Login/Sign up