Epic relies on Nomad to keep gamers’ fantasy islands afloat
Kubernetes would have meant “monumental” amount of work
Gamers might appreciate the creativity and dedication that goes into producing their favourite experiences. But they could be forgiven for ignoring the underlying infrastructure that enables their development and deployment.
Epic Games technical director, Paul Sharpe, gave an insight into the massive scale and management capabilities required to support its ecosystem in a session at HashiConf.
Epic is the games powerhouse behind titles such as the Gears of War series, and more recently, Fortnite. Central to its success, and workflow, is its Unreal Engine 3D creation tool. It is used beyond the games world in applications such as automotive design, architecture, and medicine.
Fortnite’s Creative Mode allows players to build their own games and experiences on their own “private Fortnight islands”.
In turn, explained Sharpe, “We began to expand on the ideas and technologies involved with the goal of building the foundation for an open ecosystem where players can discover and participate in a vast variety of meaningful experiences together with no gatekeepers whatsoever.”
The result was Unreal Editor for Fortnite (UEFN), which was launched in March, 2023. It gives creators access to tooling and assets, as well as revenue “based on player engagement with their experiences.” Sharpe said that over 700,000 “Islands” have already been created since the launch.
Unreal Editor for Fortnite: "Cooking" up a storm
While UEFN is a PC application from the creator’s point of view, creating those resources requires a vast amount of backend resource, not least in “cooking”.
Sharpe described cooking as “The process in which assets are compiled or converted into platform specific formats required to be loaded by the engine on said platforms, Windows, PlayStation, Xbox and Switch.”
This involves operations like compiling shaders, converting audio formats, and translating static meshes. Sharpe said these are extremely resource intensive, and are normally done on “a developer workstation with all the cores and all the RAM, or on a giant render farm.”
Supporting the creator economy means creating “a platform for doing these kinds of operations, potentially with tens to 100s of 1000s of simultaneous workflows in play.”
This required a scheduling and orchestration tool. Epic was already using HashiCorp’s Nomad in its Unreal Cloud Services and saw it as a fit for the requirements of supporting the UEFN cooking workflow. (Nomad is a workload scheduler and cluster manager to deploy and manage containers and non-containerized applications.)
According to Sharpe, these included “Reliable container orchestration available through an API, multistage workflows for executing batch jobs, control over container runtimes to configure compute isolation, solid filesystem isolation to secure user generated content whenever anyone can bring their own assets and whatnot. And the ability to operate on Windows.”
Join peers following The Stack on LinkedIn
Key features in Nomad included the ability to control the scheduling algorithm for containers. “Using the spread algorithm allows us to fan these cooks out horizontally and minimise any CPU contention when hopping onto a box that wasn't fully loaded.”
This was important, said Sharpe, because due to the size of the containers involved, and how long it takes Windows to pull a container, “We couldn't really rely on reactive auto scale.”
Another key feature, he said, was task lifecycles. “The concept of having a main task with hooks to launch other tasks around it was precisely what we needed for a cook workflow,” he explained. “If we were using Kubernetes, we would have had to extend it to provide this functionality with Nomad again, we just had it out of the box."
Attempting to use Kubernetes for supporting the platform, “would have been a monumental amount of extra work, particularly due to the multistage workflow, filesystem and Windows requirements.”
Not that Nomad presented no challenges whatsoever. One was managing the firehose of data and metrics Nomad could throw off. “They’re really helpful,” he said, but “familiarise yourself with it, because there's a million.”
But, he explained, “the majority of issues that we've had in the past six months have come from running Windows, and particularly Docker interactions.” Fixing these required non-intuitive network tweaks.
The initial aim was to scale up for 20,000 concurrent cooks, which meant having a cluster that had over 2,000 hosts available for jobs.
However, when the team began load tests, things choked, with the scheduling workers in the control plane simply overwhelmed. This was solved by upgrading the control plane to “much larger systems with all the cores and all the NVMe drivers” and splitting the cluster up into smaller clusters.
Do the gamers care about all this work?
Probably not, though they arguably should. You can’t have a battle royale if the entire back end is going down in flames.