Dead Servers and Sweaty Men

It seemed like a regular Friday at API Fortress, when suddenly our failure alert Slack channel starting blowing up. The alerts were API failures from MajorCRM Inc (keeping this private for obvious reasons). When you call APIs constantly, as we do, random HTTP failures are common. We can see connection timeouts, socket timeouts, bad gateway, service unavailable, etc… As long as the events are isolated there really isn’t much to worry about, although some could be prevented.

On this Friday the errors were coming rapidly, and were all from the same company. When something of this magnitude happens on an enterprise account our QA team investigates to extract information that could help the customer diagnose what is happening. There are multiple potential causes of this sort of error – an ESB failure, API Manager failure, network problem, connection problem to the database. Most of these can be easily recognized by just looking at our test reports. This wasn’t one of those times.

The first piece of evidence was that all the failures were coming from one specific datacenter: San Jose, California. Geographic correlation is a good starting point. The second thing we noticed were the failures were intermittent. Generally this means the failing component might be going up and down, or there’s one server failing inside a cluster. But these two pieces of information were nothing compared to what a closer look at the details showed: the errors were all different! Here’s a partial list:

Timeout during socket read
data isn’t an object ID
Signature algorithm mismatch
lengthTag=127, too big
Sequence tag error
compression type not supported, 4
insufficient data
unable to find valid certification path to requested target
invalid distance too far back
invalid bit length repeat
Not in GZIP format
invalid code — missing end-of-block
not an Octet String

It did not take long to realize the errors were related to SSL certificates AND GZIP compression. We have not asked them what happened in detail, but one thing we know for sure is that there was a sweating and cursing IT guy that deserved our sympathy. They had a bad Friday.

stressed out