I’ve been playing recently an online game that has recently launched, that uses the following idea.
When a user starts a match, it spawns a process in the server that acts as the opponent, generating the actions against the user.
The game had a rough launch, with a lot of problems due it being played by a lot of people. And, IMHO, a lot of the problems can be traced to that idea.
I see it’s a seductive one. If a user generates an interaction with the service that takes time (for example, a match for this game), spawn a process/thread in the server that generates the responses in “real time“. The user then will be notified through polling or pushing the information, and can react to it. The process will receive the new information from the user and adjust the responses.
I know is seductive because I had it once, and I was very lucky to have someone around with more experience that show me how it will break under pressure. It’s not a sane architecture to scale.
Some bad ideas:
- No limit on processes, meaning the servers can be overflown by context switching. Once you have several thousand processes running on a server, you are in a bad place.
- The very definition of state on the server. You need to keep track of processes started on different servers (so no two servers perform the same job). High Availability is impossible, as losing one server will mean destroying the state on all those processes. For scalability, always look at stateless servers: read all the data, store the resulting data.
- Start up times. Each time a process starts, there’s some time to boot. This can be a problem if processes are always being started and stopped, adding overhead to the system. Even starting a thread is not free (and will require probably starting internal work like connect to the DB, read from cache, etc)
- Connections explosion. If each process needs to connect to other parts of the infrastructure (DB, logging, cache, etc) you can have a problem in number of connections.
- Process monitoring. What if a process gets stuck? A request can be cancelled easily by a web server (if a request takes more than X, kill it), but an individual process or thread can be more complicated and require specific tooling.
Alternative: Pool of workers
Generate a defined number of processes that can perform the individual actions that generates a match. Each process will get an action from a queue, execute it, and store the resulting state. Any process can produce an action for any user.
For example, if a match is a set of 20 actions, each one happening every minute, the start match request will introduce 20 actions in a queue, to be extracted at the proper time, introducing the proper delay on each action. Note that the queue needs to have a way of deliver delayed messages, not every messaging queue can (in particular, RabbitMQ doesn’t have a good support). Beanstalkd or Amazon SQS supports it.
Or, alternatively, a single action, that will end inserting the next step in the queue with the adequate delay. The action can be as simple as checking if it should change something and, if not, end.
The processes will be extracting the next action from the queue, and executing them. Note that here you minimise the time the worker is waiting for a new task to do. Each worker is active as much as possible, while any user has a pending task, ready to be executed.
The number of processes are limited, so you won’t have an explosion. You can test the system and have a good idea on the limit, when your throughput is not good enough to execute the actions within a reasonable delay, so you can stop the users from starting a new match. This is a better fallback option than allowing everyone to start one and then not giving a good experience.
A priority queue can be put in place, in that case, to inform the user: “You will be able to start your match in ~3 minutes“
Or you can add more processes/servers to increase the throughput in a predictable manner.
Alternative: Whole match pregeneration
Another alternative is actually generating a set of actions and returning them in the first go, and display them at the proper times in the client side. If any adjustment is required due the actions of the user, redo all the results from that time on.
For example, a match starts, and returns the 20 server actions to the client, which shows them to the user one each minute. In the 3rd minute, the user performs an action, which makes the server to recalculate the remainder of the match and return another 17 actions. This is a good strategy if generating actions in advance is possible and few interactions from the user are expected.
The bottom line
The main word here is stateless. It is a basic component of an scalable system, and it’s always worth it to keep in mind when designing a system to be used to more than a couple of users.