Monday, November 17, 2014

In Response to Blizzard's Release Day Issues

Blizzard has been taking a beating over this recent launch. Clients worldwide are getting hit with long queue times, slow response from the game servers and even disconnected randomly. As an Information Technology professional for 15+ years, I can totally sympathize with them.

I work for a large public sector agency on a team responsible for supporting email services for close to 100,000 employees. I am responsible for the design, deployment and day-to-day operation of the computers hosting the email services. For comparison purposes, I would suspect that I am responsible for approximately 3 medium-pop realms worth of users on WoW. This is my attempt at explaining a technical environment and it's issue in as user-friendly of terms as possible.

Design: If I were to design a 'realm', I would want to mirror what we do with our email environment. It meets the needs of a high-demand environment, where the clients are using a fat client to connect and requires pretty much 24x7 access.

First off there would be resiliency. This means multiple computers that are supporting active clients

and exact-copy (passive) servers that are simply waiting for something to go wrong with the first (active) set. This allows my team to do minor repairs on the 'passive' servers while clients are actively working. This "cluster" of active & passive computers work together sharing information for that inevitable moment when everything goes haywire. As my population grows and shrinks, you can add/remove members to a cluster or you can spawn new clusters (i.e. add new realms).

Second would be security. Imagine your current residence (home, apartment, condo, etc.). Would you rather have all of your stuff simply out in the open, laying out on the lawn? Or would you want some walls, a door, locks and maybe even a high-end security system protecting your valuables? Our accounts, our characters and the entire virtual world are our valuables. Protecting that realm would involve perimeter security surrounding the entire server cluster, possibly around all clusters (i.e. region of US, EU, etc.). There would be a single point of entrance into this area for all connections to the cluster. I would likely place a high-end security device that could monitor the traffic going in, and turn away bad traffic. It's easy to manage. I won't need a millions of Internet connections (or IP addresses)

The alternative is placing each and every server directly on the Internet (i.e. out on the lawn). Doing this would open up each and every server to all sorts of malware, hacks and malicious intent. We would see a LOT more issues if this configuration was used. Anyone driving by our property could simply come by our 'place' and pickup something of value.

Day-to-Day: Blizzard grew to over 11 million players, then shrunk back down to 7 million.
This left Blizzard in a lurch with their investors as the cost of maintaining 100+ realms, but not the subscriptions to support a cluster of 6-10 high-end servers, and back-end infrastructure (backups, power, data center floor space, etc.. ). These costs don't go down. In order to reduce costs, Blizzard had to decrease their footprint on that data center floor. This is why Malfurion and Trollbane joined forces. Malfurion had shrunk, so we were migrated to Trollbane's clusters. We kept our realm info, but we've all been consolidated onto Trollbane's hardware. The beauty is lower cost of ownership on the equipment, same level of user satisfaction (in fact no to low pop servers come back to life!) and Blizzard doesn't have to raise subscription costs! The drawback would be if everyone logged in at the same time, there possibly won't be enough processing power to support the entire user base at once.

Diagnosis: There are two things I think went on. Based on the twitter feeds of several individuals, I don't think I am far off.


First off, the DDOS attack was focused on the front door of the datacenter. That front door was absolutely bombarded with millions of connections coming from both legitimate and illegitimate connections. Imagine a huge line of door-to-door salesmen ringing your doorbell at the same time your guests are arriving for a GIGANTIC party. The bouncer at the door needs to check with each person and see why they're there. We still got into the party, but the line was out the door, around the block a few times and then into the surrounding neighborhood. Even if 1:1000 people were these annoying salesmen, the bouncer is going to have a lot of extra work to do when there are several million people coming to your party. You have people scan the crowd for people carrying briefcases, or wearing suits. You ask them what the party is for and or ask questions about the host. It all takes time at that gate. Unfortunately, most of the time you simply have to ride it out.

Now, realize that these "salesmen" are not the only people hammering on the front door. The attacks could also be you and me. (wait... keep reading) At my RL job, we've had legitimate users, too chatty with our environment and take down servers. For example, Apple regularly releases a new iPhone OS and many people immediately update their devices. These updates are not always well tuned for our email servers and may require us to patch them or Apple may have a bug that needs fixed. Every once in awhile, their phone finds a corrupted calendar entry and asks 100,000 (an hour) for an update to that entry. This chattiness prevents anyone else from pulling their calendar updates. Now, with each major expansion, Blizzard often changes access interfaces on the game. These changes break add-ons. Add-ons that people come to love and rely on. Add-ons that they'll run even if outdated warning comes up. To stop the traffic, we end up blocking this user's access to their email until they fix their phone. Blizzard cannot do this for 1 million customers (easily).

Secondly, over commitment. I think Blizzard was a bit surprised in the number of concurrent users.
Now that the clusters had been shrunk to avoid additional costs, when our sudden flash mob hit the door, many of us overwhelmed the capacity. Blizzard had to put population quotas on the servers on the highest pop realms until they could add back capacity. ((This likely cost them a bunch of $$ to grab servers that quickly "here take my credit card and go buy whatever you need!!")). Like the Gym after New Years, aka "New Years Resolute-rs", the spike in traffic was there release day, then will slowly taper off. Me taking vacation time on Thursday and Friday, threw off the predicted connection patterns that they designed for. Sorry. My fault. I don't think I'll be taking the next expac release day off.

Going forward: Blizzard has gotten over the hump. The erratic release-week traffic has died down; I am back at the office now. We are already starting to see max-level 100s on the realms; saw 2 last night. Blizzard plays their cards right, we'll continue to see old players come-back to the game for about a year. After that, I would anticipate drop off of subscriptions if another expansion doesn't drop. They just need to make sure realm-performance maintains.

Thanks for the time Blizzard,

Elkagorasa

No comments:

Post a Comment

Soapbox: Lazy Form of Raiding

Over the summer, I spent 2 weeks with my family (parents, brother's family and my family) on the beach in Maui. It was absolutely awe...