6mo ago

Delta CEO says CrowdStrike-Microsoft outage cost the airline $500 million

www.cnbc.com Delta CEO says CrowdStrike-Microsoft outage cost the airline $500 million

Bastian told CNBC's "Squawk Box" on Wednesday that the carrier would seek damages from the disruptions, adding, "We have no choice."

Delta Air Lines CEO Ed Bastian said the massive IT outage earlier this month that stranded thousands of customers will cost it $500 million.
The airline canceled more than 4,000 flights in the wake of the outage, which was caused by a botched CrowdStrike software update and took thousands of Microsoft systems around the world offline.
Bastian, speaking from Paris, told CNBC’s “Squawk Box” on Wednesday that the carrier would seek damages from the disruptions, adding, “We have no choice.”

You're viewing a single thread.

134 comments

Why do news outlets keep calling it a Microsoft outage? It's only a crowdstrike issue right? Microsoft doesn't have anything to do with it?
- It's sort of 90% of one and 10% of the other. Mostly the issue is a crowdstrike problem, but Microsoft really should have it so their their operating system doesn't continuously boot loop if a driver is failing. It should be able to detect that and shut down the affected driver. Of course equally the driver shouldn't be crashing just because it doesn't understand some code it's being fed.
  Also there is an argument to be made that Microsoft should have pushed back more at allowing crowdstrike to effectively bypass their kernel testing policies. Since obviously that negates the whole point of the tests.
  Of course both these issues also exist in Linux so it's not as if this is a Microsoft unique problem.
  
  There's a good 20% of blame belonging to the penny pinchers choosing to allow third-party security updates without testing environments because the corporation is too cheap for proper infrastructure and disaster recovery architecture.
  Like, imagine if there was a new airbag technology that promised to reduce car crashes. And so everyone stopped wearing seatbelts. And then those airbags caused every car on the road to crash at the same time.
  Obviously, the airbags that caused all the crashes are the primary cause. And the car manufacturers that allowed airbags to crash their cars bear some responsibility. But then we should also remind everyone that seatbelts are important and we should all be wearing them. The people who did wear their seatbelts were probably fine.
  Just because everyone is tightening IT budgets and buying licenses to panacea security services doesn't make it smart business.
  
  In this case, it's less like they stopped wearing seatbelts, and more like the airbags silently disabled the seatbelts from being more than a fun sash without telling anyone.
  To drop the analogy: the way the update deployed didn't inform the owners of the systems affected, and didn't pay attention to any of their configuration regarding update management.
  
  The crowdstrike driver has the boot_critical flag set, which prevents exactly what you describe from happening
  
  Yeah I know but booting in safe mode disables the flag so you can boot even if something is set to critical with it disabled. The critical flag is only set up for normal operations.
- The answer is simple: they have no idea what they are talking about. And that is true for almost every topic they are reporting about.
  
  But... the BSOD!
- It was a Crowdstrike-triggered issue that only affected Microsoft Windows machines. Crowdstrike on Linux didn't have issues and Windows without Crowdstrike didn't have issues. It's appropriate to refer to it as a Microsoft-Crowdstrike outage.
  
  Funny enough, crowdstrike on Linux had a very similar issue a few months back.
  
  It's similar. They did cause kernels to crash. But that's because they hit and uncovered a bug in the ebpf sandboxing in the kernel, which has since been fixed
  
  Are they actually shipping kernel modules? Why is this needed to protect from whatever it is they supposedly protect from?
  
  They need a file io shim. That's gotta be a module or it'll be too slow.
  
  I guess microsoft-crowdstrike is fair, since the OS doesn't have any kind of protection against a shitty antivirus destroying it.
  I keep seeing articles that just say "Microsoft outage", even on major outlets like CNN.
  
  Microsoft did have an Azure outage the day before that affected airlines. Media people don't know enough about it to differentiate the two issues.
  
  To be clear, an operating system in an enterprise environment should have mechanisms to access and modify core system functions. Guard-railing anything that could cause an outage like this would make Microsoft a monopoly provider in any service category that requires this kind of access to work (antivirus, auditing, etc). That is arguably worse than incompetent IT departments hiring incompetent vendors to install malware across their fleets resulting in mass-downtime.
  The key takeaway here isn't that Microsoft should change windows to prevent this, it's that Delta could have spent any number smaller than $500,000,000 on competent IT staffing and prevented this at a lower cost than letting it happen.
  
  Delta could have spent any number smaller than $500,000,000 on competent IT staffing and prevented this at a lower cost than letting it happen.
  I guarantee someone in their IT department raised the point of not just downloading updates. I can guarantee they advise to test them first because any borderline competent I.T professional knows this stuff. I can also guarantee they were ignored.
  
  Also, part of the issue is that the update rolled out in a way that bypassed deployments having auto updates disabled.
  You did not have the ability to disable this type of update or control how it rolled out.
  https://www.crowdstrike.com/blog/falcon-content-update-preliminary-post-incident-report/
  
  Their fix for the issue includes "slow rolling their updates", "monitoring the updates", "letting customers decide if they want to receive updates", and "telling customers about the updates".
  Delta could have done everything by the book regarding staggered updates and testing before deployment and it wouldn't have made any difference at all. (They're an airline so they probably didn't but it wouldn't have helped if they had).
  
  Delta could have done everything by the book
  Except pretty much every paragraph in ISO27002.
  That book?
  Highlights include:
  ops procedures and responsibilities
  change management (ohh. That's a good one)
  environmental segregation for safety (ie don't test in prod)
  controls against malware
  INSTALLATION OF SOFTWARE ON OPERATIONAL SYSTEMS
  restrictions on software installation (ie don't have random fuckwits updating stuff)
  ..etc. like, it's all in there. And I get it's super-fetch to do the cool stuff that looks great on a resume, but maybe, just fucking maybe, we should be operating like we don't want to use that resume every 3 months.
  External people controlling your software rollout by virtue of locking you into some cloud bullshit for security software, when everyone knows they don't give a shit about your apps security nor your SLA?
  Glad Skippy's got a good looking resume.
  
  Yes, that book. Because the software indicated to end users that they had disabled or otherwise asserted appropriate controls on the system updating itself and it's update process.
  That's sorta the point of why so many people are so shocked and angry about what went wrong, and why I said "could have done everything by the book".
  As far as the software communicated to anyone managing it, it should not have been doing updates, and cloudstrike didn't advertise that it updated certain definition files outside of the exposed settings, nor did they communicate that those changes were happening.
  Pretend you've got a nice little fleet of servers. Let's pretend they're running some vaguely responsible Linux distro, like a cent or Ubuntu.
  Pretend that nothing updates without your permission, so everything is properly by the book. You host local repositories that all your servers pull from so you can verify every package change.
  Now pretend that, unbeknownst to you, canonical or redhat had added a little thing to dnf or apt to let it install really important updates really fast, and it didn't pay any attention to any of your configuration files, not even the setting that says "do not under any circumstances install anything without my express direction".
  Now pretend they use this to push out a kernel update that patches your kernel into a bowl of luke warm oatmeal and reboots your entire fleet into the abyss.
  Is it fair to say that the admin of this fleet is a total fuckup for using a vendor that, up until this moment, was generally well regarded and presented no real reason to doubt while being commonly used? Even though they used software that connected to the Internet, and maybe even paid for it?
  People use tools that other people build. When the tool does something totally insane that they specifically configured it not to, it's weird to just keep blaming them for not doing everything in-house. Because what sort of asshole airline doesn't write their own antivirus?
  
  General practices aside, should they really not plan anybackups system though? Crowd strike did not cause 500 million in damages to delta, deltas disaster recovery response did.
  Where do we draw the line there though I'm not sure. If you set my house on fire but the fire department just stands outside and watches it burn for no reason, who should I be upset with?
  
  Well, in your example you should be mad at yourself for not having a backup house. 😛
  There's a lot of assumptions underpinning the statements around their backup systems. Namely, that they didn't have any.
  Most outage backups focus on datacenter availability, network availability, and server availability.
  If your service needs one server to function, having six servers spread across two data centers each with at least two ISPs is cautious, but prudent. Particularly if you're setup to do rolling updates, so only one server should ever be "different" at a time, leaving you with a redundant copy at each location no matter what.
  This goes wrong if someone magically breaks every redundant server at the same time. The underlying assumption around resiliency planning is that random failure is probabilistic in nature, and so by quantifying your failure points and their failure probability you can tune your likelihood of an outage to be arbitrarily low (but never zero).
  If your failure isn't random, like a vendor bypassing your update and deployment controls, then that model fails.
  A second point: an airline uses computers that aren't servers, and requires them for operations. The ticketing agents, the gate crew that manages where people sit and boarding, the ground crew that need to manage routine inspection reports, the baggage handlers that put bags on the right cart to get them to the right plane, and office workers who manage stuff like making sure fuel is paid for, that crews are ready for when their plane shows up and all that stuff that goes into being an airline that isn't actually flying planes.
  All these people need computers, and you don't typically issue someone a redundant laptop or desktop computer. You rely on hardware failures being random, and hire enough IT staff to manage repairs and replacement at that expected cadence, with enough staff and backup hardware to keep things running as things break.
  Finally, if what you know is "computers are turning off and not coming back online", your IT staff is swamped, systems are variously down or degraded, staff in a bunch of different places are reporting that they can't do their jobs, your system is in an uncertain and unstable position. This is not where you want a system with strict safety requirements to be, and so the only responsible action is to halt operations, even if things start to recover, until you know what's happening, why, and that it won't happen again.
  As more details have come out about the issues that Delta is having, it appears that it's less about system resiliency, although needing to manually fix a bunch of servers was a problem, and more that the scale of flight and crew availability changes overloaded that aforementioned scheduling system, making it difficult to get people and planes in the right place at the right time.
  While the application should be able to more gracefully handle extremely high loads, that's a much smaller failure of planning than not having a disaster recovery or redundancy plan.
  So it's more like I built a house with a sprinkler system, and then you blew it up with explosives. As the fire department and I piece it back together, my mailbox fills with mail and tips over into a creek, so I miss paying my taxes and need to pay a penalty.
  I shouldn't have had a crap mailbox, but it wouldn't have been a problem if you hadn't destroyed my house.
  
  First thank you for taking the time to type all of that out.
  I think I follow your theory well enough but (I know this is 2 weeks later so I won't look up any new information) I was under the impression delta was an outlier in their response compared to other airlines.
  And one point about redundancies. Why shouldnt they consider a single operating system as a single failure point? If all 6 servers in the multiple locations all run windows, and windows fails thats awful right? Can they not dual boot orhavee a second set of servers? I do this in my own home but maybe thats not something that scales well.
  I'm interested if your opinion has changed now that there has been a bit of time to have some more data come out on it.
  
  Competent IT staffing includes IT management
  
  Delta didn't download the update, tho. Crowdstrike pushed it themselves.
  
  yes, the incompetence was a management decision to allow an external vendor to bypass internal canary deployment processes.
  
  If you own the network you can prevent anything you want.
  
  The key takeaway here isn't that Microsoft should change windows to prevent this, it's that Delta could have spent any number smaller than $500,000,000 on competent IT staffing and prevented this at a lower cost than letting it happen.
  Well said.
  Sometimes we take out technical debt from the loanshark on the corner.
- Honestly, with how terrible Windows 11 has been degrading in the last 8 or 9 months, it's probably good to turn up the heat on MS even if it isn't completely deserved. They're pissing away their operating system goodwill so fast.
  There have been some discussions on other Lemmy threads, the tl;dr is basically:
  Microsoft has a driver certification process called WHQL.
  This would have caught the CrowdStrike glitch before it ever went production, as the process goes through an extreme set of tests and validations.
  AV companies get to circumvent this process, even though other driver vendors have to use it.
  The part of CrowdStrike that broke Windows, however, likely wouldn't have been part of the WHQL certification anyways.
  Some could argue software like this shouldn't be kernel drivers, maybe they should be treated like graphics drivers and shunted away from the kernel.
  These tech companies are all running too fast and loose with software and it really needs to stop, but they're all too blinded by the cocaine dreams of AI to care.
  
  They're pissing away their operating system goodwill so fast.
  They pissed it away {checks DoJ v. Microsoft} 25 years ago.
  
  Windows 7 and especially 10 started changing the tune. 10: Linux and Android apps running integrated to the OS, huge support for very old PC hardware, support for Android phone integration, stability improvements like moving video drivers out of the kernel, maintaining backwards compatibility with very old apps (1998 Unreal runs fine on it!) by containerizing some to maintain stability while still allowing old code to run. For a commercial OS, it was trending towards something worth paying for.
  
  I don’t know that Microsoft has OS goodwill. People use it because the apps are there, not because Windows has a good user experience.
  
  The driver is wqhl approved, but the update file was full of nulls and broke it.
  Microsoft developed an api that would allow anti malware software to avoid being in ring 0, but the EU deemed it to be anti competitive and prohibited then from releasing it.
  
  I think what I was hearing is that the CrowdStrike driver is WHQL approved, but the theory is that it's just a shell to execute code from the updates it downloads, thus effectively bypassing the WHQL approval process.
- Because Microsoft could have prevented it by introducing proper APIs in the kernel like Linux did when crowdstrike did the same on their Linux solution?
- Its sort of like calling the terrorist attack on 911 the day the towers fell.
  Although in my opinion, microsoft does have some blame here, but not for the individual outage, more for windows just being a shit system and for tricking people into relying on it.

134 comments