Campus Canvas outage disrupts classes

Students on campus saw a halt to courses, exams, and assignments as Amazon Web Services experienced a total shutdown. // Photo courtesy of Arthur Mansavage, Beacon Staff

 Amazon Web Services’ (AWS) US-East-1 data center started reporting failures in the domain name system (DNS) resolution around 3:00 a.m. EST on Oct. 20. What followed was several hours of broken platforms, lagging online operations and disrupted services worldwide. 

The initial DNS, used to connect computers to servers, error triggered a train of failures across different dependencies in multiple systems and affected several services that rely on AWS, including Slack, Roblox, Snapchat, Duolingo and several banking and airline sites. 

Many businesses around campus, including Atwoods Pizza and Momonoki, also had their online ordering services go down. 

“I was teaching a class and had no idea about the outage,” said Professor Aibek Musaev, a lecturer in the College of Computing. “I was explaining the difference between the two kinds of exceptions and actually used Amazon as an example — like, imagine what would happen if Amazon went down and a bug made it all the way to production. After class, a student came up to me and said Amazon’s actually down, and I said, ‘No, it’s just an example.’ But they told me it was real.” 

Along with several enterprise-level platforms, Tech students were directly affected by the outage, with Canvas shutting down, a learning management system used by schools to consolidate assignments, grades and course materials.  . 

“We have automated monitoring set up on all our critical platforms and tools, so we knew Canvas was down before the announcements went out,” said Warren Goetzel, Director of Academic Technology in the Office of Information Technology (OIT). “So we knew pretty much right away and started a major incident process. The maintenance page was put up on the website, and they decided to put it up because it was unstable and they didn’t want to have students’ experiences interrupted or data loss, so the maintenance page stayed up till about 7:30 p.m. on Monday.” 

Goetzel’s team is a part of academic and research technologies (ART), which includes three teams: digital learning, PACE and TAG. The digital learning team acts as service manager and service owner for Canvas, as well as all the platforms and tools in the teaching and learning ecosystem at Tech. 

The first step in the major incident process is to post on the school’s status page, and then students and staff received emails letting them know of the situation. Throughout the day, they monitored statuses and communicated internally with staff. Because it was an AWS issue, there was nothing more that they could do from campus. 

The disruption felt on campus and around the world has led many to discuss whether there is too much dependency on cloud-based platforms. 

“The outage basically put everything behind,” said Professor Jessica. “I couldn’t post slides, make announcements or do grading because all of that goes through Canvas. It was only one day, but it disrupted the whole week’s schedule. I think this shows that there definitely is too much dependency on one very fallible system. It works most of the time, which is great, but this outage shows how vulnerable these systems are and how dependent we’ve become.”

Goetzel explained that although these disruptions can happen, ones at this scale are rare, and most platforms are stable the majority of the time.

“Global outages like this don’t happen often,” Goetzel said. “Canvas proudly markets a 99.9% uptime. So the disruption of academic continuity is very limited. But, just like any other emergency, contingencies need to be made. I mean, we all understand that disruptions to teaching and learning are very problematic, and in our increasing dependency on technology, we don’t always have a failover plan for a no-tech or low-tech option. So I think it’s important for folks to be mindful that it can happen, and to be prepared for the alternative.” 

Advertising