Lessons from the Recent Facebook iOS SDK Crash

 ● 10th Jul 2020

4 min read

On July 10, popular apps like Spotify and Tinder suffered crashes due to a Facebook SDK issue. OverOps CTO Tal Weiss breaks down a few key learnings development teams can take from this mishap.

The recent Facebook iOS SDK crash highlights how a code or environment change by another company (e.g. FB) whose API you rely on can bring down your own app for no fault of your own. In 2020 no app is an :desert_island::tropical_fish:.

It’s worth noting this scenario could even be true when relying on internal company APIs (did someone say microservices?). For the API owner the lesson is clear – shift left testing and applying early code error detection with slow rollouts is the order of the day. That’s what we advise our customers and how we help them build the right toolchains to try and avoid these types of nightmare scenarios :scream:.

And now for the SDK dev user: always (always!) design your app so that use of internal or 3rd party web services is done defensibly – assume they will fail! :broken_heart: And if they do, make sure you can handle it gracefully by disabling that part of your app which relies on them, and be as explicit as you can with your users to let them know what’s happening.

Another approach to take if possible (which may not always be the case) is to consume the API from your backend and relay the results to the mobile client. This way your ability to detect issues, disable functionality or even mock benign data as a response back to your mobile client is significantly increased.

All of this is 10X important in a scenario such as this where you’re consuming a mobile API that in case of catastrophic failure (e.g users can’t even start the app!) there’s no option of updating your code without end-users downloading an update to their device. 

As to the specific root cause of the issue: most companies who face downtimes such as this, either to their own service or worse yet inflicting ones on other companies which rely on them, are lackluster to provide a detailed postmortem as to the root cause of the issue. You can already see requests made by developers to FB on this issue and previous on here and here. Most often than not the reason quoted is that providing this could be used by hackers to glean insight into the inner working of the service in a way that would enable them to attack it later on more effectively. 

While this is true, I can also attest that having done probably hundreds of post mortem analyses on production downtimes in the past, I do believe there’s (almost) always a safe way of detailing the cause of the issue in a way that provides both transparency and a teachable moment for other developers to learn and benefit from the potential mistakes of others and prevent future downimes for others.

As an industry, I believe we would be better off if we were to make it a practice that postmortem that are as detailed as possible are provided after such events, not as a means of “shaming” the folks involved, but as a way of sharing best practices and improving together.

I’m sure lessons will be learned by both FB (I’d hate to be the team that broke that service) on how to safely make backend changes that impact end-user SDKs, and by mobile developers who’ve learned the hard way that no API is infallible, and they should not treat it as such, even one made by omnipotent-ish FB.

As a co-founder and CTO, Tal is responsible for overseeing OverOps' product and engineering strategy. Previously, Tal was co-founder and CEO at VisualTao, acquired by Autodesk Inc. (ADSK). Following that, Tal was the Director for the AutoCAD global Cloud and Mobile product line. Plays Jazz drums and Skypes, sometimes simultaneously.

Troubleshooting Apache Spark Applications with OverOps OverOps’ ability to detect precisely why something broke and to see variable state is invaluable in a distributed compute environment.
Troubleshooting Apache Spark Applications with OverOps

Next Article

The Fastest Way to Why.

Eliminate the detective work of searching logs for the Cause of critical issues. Resolve issues in minutes.
Learn More