Human error caused 2022 Rogers outage, system ‘deficiencies’ made it worse: report

The 2022 Rogers outage that left 12 million people without wireless and hard-wired services was caused by human error and made worse by management and system “deficiencies,” says an independent review conducted for Canada’s telecommunications regulator.

The review report also says steps taken by Rogers since the outage are “satisfactory to improve the Rogers network resiliency and reliability, as well as to address the root cause of the July 2022 outage.”

The 15-hour outage started early in the evening of July 8 and left individuals and businesses without access to their mobile, home phone, Internet and 911 services.

The Canadian Radio-television and Telecommunications Commission (CRTC) commissioned Xona Partners in September 2023 to undertake the review and determine what caused the outage.

The engineering consultancy was also tasked with looking at whether the measures taken by Rogers since the outage are sufficient to prevent another incident.

Xona Partners’ findings were contained in the executive summary of the review report, released this month. The CRTC says the full report contains sensitive information and will be released in redacted form at a later, unspecified, date.

The report summary says that in the weeks leading up to the outage, Rogers was undergoing a seven-phase process to upgrade its network. The outage occurred during the sixth’s phase of the upgrade.

“The July 2022 outage is attributed to an error in configuring the distribution routers within the Rogers IP network,” the report says.

Staff at Rogers caused the shutdown, the report says, by removing a control filter that directed information to its appropriate destination.

Without the filter in place, a flood of information was sent into Rogers’ core network, overloading and crashing the system within minutes of the control filter being removed.

Algorithm designated network upgrade as ‘low’ risk

The report says Rogers’ core network manages wireless and hard-wired data both internally, within the company, and externally, for outside customers and service providers.

“With both the wireless and wireline networks sharing a common IP core network, the scope of the outage was extreme in that it resulted in a catastrophic loss of all services,” the report says.

Having wireless and wireline services share the same network is a practice “common to many service providers,” the report says, adding that companies find it an efficient way to “balance cost with performance.”

Rogers has since announced that it will develop a new, separate, network for its wireless systems while keeping hard-wired services on the old core network. The report says that work is ongoing.

Men and women stand outside a coffee shop on a busy city street. They are all looking at their phones.
People use electronics outside a coffee shop in Toronto amid a nationwide Rogers outage on Friday, July 8, 2022. (Cole Burston/The Canadian Press)

The review says that because the first five stages of the network update took place without incident, “the risk assessment algorithm downgraded the risk level for the sixth phase” of the upgrade.

Designating risks in phase six as “low” meant Rogers’ staff could avoid additional levels of scrutiny and approvals as the upgrade proceeded, even though doing so “contravenes industry norms,” the report says.

Rogers says it has since installed a new risk assessment algorithm to address the issue.

The executive summary of Xona Partners’ review also says the “network failure could have been prevented” if Rogers had “overload protection mechanisms” limiting how much information is funnelled into the core network. 

The review recommends that all telecom networks in Canada implement overload protection mechanisms for their core networks. 

Challenges restoring the network

A central issue frustrating Rogers’ efforts to get its systems back up once they went down was the corporation’s inability to communicate properly.

The report says that when the core network went down, remote employees were unable to access Rogers’ systems or use the internet and could not get online by using other service providers.

“Rogers had to dispatch staff to remote sites to physically access the affected routers, which delayed network recovery efforts,” the report says.

All incident response and crisis team members at Rogers have since been provided with backup, third-party access to the internet to “maintain communication capabilities during outages.”

The review also says that Rogers staff could not access critical error logs detailing the root cause of the outage until 14 hours after the outage began, which “adversely impacted outage recovery efforts.”

John Lawford, executive director of the Public Interest Advocacy Centre in Ottawa, has been pushing Rogers and the CRTC for more transparency on the outage. 

He criticized the CRTC for taking two years to deliver a report on the outage, describing it as a “whitewash in the sense of both the CRTC and Rogers being very much let off the hook.”

“The report makes a claim that Rogers has rectified the issue and there is insufficient evidence for me there to see that,” Lawford said. “This is just one particular expert’s viewpoint.”

Rogers declined CBC News’ requests for an interview. It issued a statement insisting it is “the most reliable network” and saying it will continue to invest so “Canadians enjoy the best networks in the world.”

“We completed a full review of our networks, strengthened our network resiliency, and implemented all the recommendations of this report,” Rogers said. 

Source