The IRS spent Tax Day trying to resolve IT issues rather than processing last-minute returns.
When the news broke that the IRS’s Modernized e-File (MeF) system was down, along with the Direct Pay and Payment Plan pages on the IRS site, three possible scenarios that can take businesses down came to mind: a hack, overloaded systems, or pure coincidence.
How likely is each?
While an attacker could breach the systems and/or perimeter and turn off services that allow connections to the systems accepting the tax returns, no one is claiming responsibility and so far, there’s no evidence that this is a malicious denial of service attack.
It could just be an outage unrelated to the workload, one that coincidentally occurred on arguably the worst possible day. In that case, the question is how quickly can the IRS bring its systems back up. To do so within acceptable recovery time objectives (RTOs) and recovery point objectives (RPOs), it should be a matter of recovering their production environment or switching over to backup systems.
But the most likely reason for the outage? The IRS’ systems couldn’t handle the workload from all the last-minute tax filers submitting their returns online.
Like outages that occur during Black Friday or online concert ticket sales, the MeF system might not have been prepared to handle the burst of activity. It’s akin to the days of yore when you’d try to dial in to a radio contest and the operator would come on saying, “All lines are busy, please hang up and try again.”
Simply put, this outage could be a denial of service, and not necessarily one caused by malicious activity from a horde of bots. It could just be a flood of legitimate submissions and traffic.
Companies can prepare for spikes in traffic with distributed cloud-based systems that can increase bandwidth and processing power on demand. If one system goes down, the others will still be available to handle requests.
Cloud computing provides an easy way to bring up many identical servers in a short time period that provide the same functionality. They can also be scattered geographically to provide distributed computing capabilities and reduce network saturation.
For now, it seems the IRS needs to make its systems more resilient, and one place to start is updating its disaster recovery plan. Reports said the IRS was working to reboot its system, and if that didn’t work, it could take up to a day and a half to get a backup system running.
While we wait to find out more details on the issue, the lesson for the IRS (and all organizations) today is to make sure your production systems are designed to handle unusual influxes of activity and/or add capacity and resources to keep up with the demand.