đź§µ View Thread
đź§µ Thread (10 tweets)

Culture is a real thing in the world, and in a healthy engineering culture, someone not involved in immediate firefighting would take that engineer aside within minutes and say “You are probably worried for your job right now. Don’t be. Let me explain incident postmortems.” https://t.co/oDVcTjNkUK

I have had the honor of delivering this lecture before, because it seemed like a useful thing to do given no ability to meaningfully accelerate our path to recovery. And goodness knows I have plenty of anecdotes of breaking things to share with young engineers.

There is at least one engineer who I got hired into a particular company, in a major career upgrade. They perceived themselves as being at fault for an issue which seemed like a Big Deal at the time, and got in touch with me to apologize, because they assumed I was toast.

(There are, indeed, some places in the world where one is perceived to vouch for people that one hires, and their failures are thus your failures. If you have long lived in one of those cultures, worrying your mistake was a career ender for me sounds natural.)

And so I told him that he could absolutely put out of his mind any worry for me, because the culture that is the professional class in America would certainly not chain responsibility in that fashion. But more importantly, he shouldn’t be worried for himself either.

“I never would have hired you into a shop that would countenance [firing an engineer for a systems issue].” might be an almost verbatim quote from that discussion. The last I checked I think he was still a staff engineer there, some years after an incident that few remember.

Oh making something explicit that was implicit: there do exist companies who have blameless postmortem culture whose senior engineers would not take time out of their day to console a terrified newbie. And there exist ones with blameless postmortems where seniors would.

Apropos of nothing: a change Rails made many years after existing, which is a positive change, is refusing to reset the database if you point it at a production environment. Culture isn’t just a firm thing. We, as an industry, have tolerated footguns far, far too long.

There is some blend of machismo, “Well I’m certainly not incompetent so I don’t need guardrails to prevent me wiping prod”, and a bit of hazing culture all mixed up in this. How about… no? How about “all tools should make it almost impossible to delete prod, out of the box.”

You write a tool which is capable of deleting prod then you slap your forehead. “Yep I’m going to get some kid fired five years from now at a terrible employer, phew, let me do you a solid kiddo. Well, OK, not making a value judgement on whether you being there is good idea.”