Skip to Main Content

Pandemic Control, Game Theory, and Machine Learning

Yao Xuan
Robert Balkin
Jiequn Han
Ruimeng Hu
Hector D. Ceniceros

Communicated by Notices Associate Editor Reza Malek-Madani

Article cover

COVID-19 and Control Policies

The coronavirus disease 2019 (COVID-19) pandemic has brought an enormous impact on our lives. Based on data from the World Health Organization, as of May 2022, there have been more than 520 million confirmed cases of infection and more than 6 million deaths globally; In the United States, there have been more than 83 million confirmed cases of infection and more than one million cases of death. Needless to say, the economic impact has also been catastrophic, resulting in unprecedented unemployment and the bankruptcy of many restaurants, recreation centers, shopping malls, etc.

Control policies play a crucial role in the alleviation of the COVID-19 pandemic. For example, lockdown and work-from-home policies and mask requirements on public transport and public areas have been proved to be effective in stopping the spreading of COVID-19. On the other hand, governors also have to be aware of the economic activity loss due to these pandemic control policies. Therefore, a thorough understanding of the evolution of COVID-19 and the corresponding decision-making provoked by such a virus will be beneficial for future events and in other interconnected systems around the world.


Epidemiology is the science of analyzing the distribution and determinants of health-related states and events in specified populations. It is also the application of this study to the control of health problems. Infectious diseases are one of this kind, including the ongoing novel coronavirus (COVID-19).

Since March 2020, when the World Health Organization declared the COVID-19 outbreak a global pandemic, epidemiologists have made tremendous efforts to understand how COVID-19 infections emerge and spread and how they may be prevented and controlled. Many epidemiological methods involve mathematical tools, e.g., using causal inference to identify causative agents and factors for its propagation, and molecular methods to simulate disease transmission dynamics.

The first epidemic model concerning epidemic spreading dates back to 1760 by Daniel Bernoulli Ber60. Since then, many papers have been dedicated to this field and, later on, to epidemic control. Among control strategies, the quarantine, firstly introduced in 1377 in Dubrovnik on Croatia’s Dalmatian Coast GB97, has proven a powerful component of the public health response to emerging and reemerging infectious diseases. However, quarantine and other measures for controlling epidemic diseases have always been controversial due to the potentially raised political, ethical, and socioeconomic issues. Such complication naturally calls for the inclusion of decision-making in epidemic control, as it helps to answer how to take optimal actions to balance public interest and individual rights. But not until recent years have there been some research studies in this direction. Moreover, when multiple authorities are involved in the decision-making process, it is challenging to analyze how to collectively or competitively make decisions due to the difficulty of solving this high-dimensional problem.

In this article, we focus on the decision-making development for the intervention of COVID-19, aiming to provide mathematical models and efficient numerical methods, and justifications for related policies that have been implemented in the past and explain how the authorities’ decisions affect their neighboring regions from a game theory viewpoint.

Mathematical models

In a classic, compartmental epidemiological model, each individual in a geographical region is assigned a label, e.g., Susceptible, Exposed, Infectious, Removed, Vaccinated. Different labels represent different status – S: those who are not yet infected; E: who have been infected but are not yet infectious themselves; I: who have been infected and are capable of spreading the disease to those in the susceptible category, R: who have been infected and then removed from the disease due to recovery or death, and V: who have been vaccinated and are immune to the infection. As COVID-19 progressed, it was learned that spread from asymptomatic cases was an important driving force. More refined models may further split I into mild-symptomatic/asymptomatic individuals who are in-home for recovery and serious-symptomatic ones that need hospitalization. We point to AZM20 which considers a similar problem in the optimal control setting, which includes asymptomatic individuals and the effect of impulses.

Individuals transit between these compartments, and the labels’ order in a model indicates the flow patterns between the compartments. For instance, in a simple SEIR model LHL87 (see also Figure 1a), a susceptible individual becomes exposed after close contact with infected individuals; exposed individuals become infectious after a latency period; and infected individuals become removed afterward due to recovery or death. Let , , and be the proportion of population of each compartment at time , the following differential equations provide the mathematical model:

where is the average number of contacts per person per time, describes the latent period when the person has been infected but not yet infectious, and represents the recovery rate measuring the proportion of people recovered or dead from infected population.

Figure 1.

(a) A simple SEIR model: susceptible individuals become exposed after close contact with infected ones; those exposed become infectious after a latency period; and those infected become removed afterward due to recovery or death; (b) Controlled SEIR model: the planner chooses the level of nonpharmaceutical policies (lockdown or work from home) and pharmaceutical policies (effort of vaccination development or distribution) affecting the transitions such that only of the original susceptible and infectious individuals can contact each other, and affecting the recovery rate from infectious individuals to removed ones, here is used describe the effectiveness of policy ; (c) An illustration of the game-theoretic SEIR model for two regions.

Graphic without alt text

Many infections, such as measles and chickenpox, confer long-term, if not lifelong, immunity, while others, such as influenza, do not. As evidenced by numerous epidemiological and clinical studies analyzing possible factors for COVID reinfections, COVID-19 falls precisely into the second category NBN22. Mathematically, this can be taken into account by adding a transition .

Though deterministic models such as 1 have received more attention in the literature, mainly due to their tractability, stochastic models have some advantages. The epidemic-spreading progress is by nature stochastic. Moreover, introducing stochasticity to the system could account for numerical and empirical uncertainties, and also provide probabilistic predictions, i.e., a range of possible scenarios associated with their likelihoods. This is crucial for understanding the uncertainties in the estimates.

One class of stochastic epidemic models uses continuous-time Markov chains, where the state process takes discrete values but evolves in continuous time and is Markovian. In a simple Stochastic SIS (susceptible-infectious-susceptible) model KL89 with a population of individuals, let be the number of infected individuals at time , the rate of infected individuals infecting those susceptible, and the rate that an infected individual recovers and becomes susceptible again. The transition probabilities among states , , are

Another way to construct a stochastic model is by introducing white noise in 1 TBV05All08, which we shall mainly consider in this paper and describe in details in the later section.

Control of disease spread

After modeling how diseases are transmitted through a population, epidemiologists then design corresponding control measures and recommend health-related policies to the region planner.

In general, there are two types of interventions: pharmaceutical interventions (PIs), such as getting vaccinated and taking medicines, and nonpharmaceutical interventions (NPIs), such as requiring mandatory social distancing, quarantining infected individuals, and deploying protective resources. For the ongoing COVID-19, intervention policies that have been implemented include, but are not limited to, issuing lockdown or work-from-home policies, developing vaccines, and later expanding equitable vaccine distribution, providing telehealth programs, deploying protective resources and distributing free testing kits, educating the public on how the virus transmits, and focusing on surface disinfection.

Mathematically, this can be formulated as a control problem: the planner chooses the level of each policy affecting the transitions in 1 such that the region’s overall cost is minimized. Generally, NPIs help mitigate the spread by lowering the infection rate , e.g., a lockdown or work-from-home policy implemented at time modifies the transition to

meaning that only of the original susceptible and infectious individuals can contact each other where describes the effectiveness of AAL20 (see Figure 1b). PIs such as taking preventive medicines, if available, will also lower the infection rate , while using antidotes will increase the recovery rate . The modeling of vaccinations is more complex. Depending on the target disease, it may reduce (less chance to be infected) or increase (faster recovery). It may even create a new compartment “Vaccinated” in which individuals cannot be infected and which is an absorbing state if lifelong immunity is gained.

A region planner, taking into account the interventions’ effects on the dynamics 1, decides on policy by weighing different costs. These costs may include the economic loss due to decrease in productivity during a lockdown, the economic value of life due to death of infected individuals, and other social-welfare costs due to the aforementioned measurements.

Game-theoretic SEIR Model

Game theory studies the strategic interactions among rational players and has applications in all fields of social science, computer science, financial mathematics, and epidemiology. A game is noncooperative if players cannot form alliances or if all agreements need to be self-enforcing. Nash equilibrium is the most common kind of self-enforcing agreement Nas51, in which a collective strategy emerges from all players in the game to which no one has an incentive to deviate unilaterally.

Nowadays, as the world is more interconnected than ever before, one region’s epidemic policy will inevitably influence the neighboring regions. For instance, in the US, decisions made by the governor of New York will affect the situation in New Jersey, as so many people travel daily between the two states. Imagine that both state governors make decisions representing their own benefits, take into account others’ rational decisions, and may even compete for the scarce resources (e.g., frontline workers and personal protective equipment). These are precisely the features of a noncooperative game. Computing the Nash equilibrium from such a game will provide valuable, qualitative guidance and insights for policymakers on the impact of specific policies.

We now introduce a multi-region stochastic SEIR model XBH22 to capture the game features in epidemic control. We give an illustration for two regions in Figure 1c. Each region’s population is divided into four compartments: Susceptible, Exposed, Infectious, and Removed. Denote by the proportion of the population in the four compartments of the region at time . They satisfy the following stochastic differential equations (SDEs), which have included interventions (PIs and NPIs), stochastic factors, and game features,

where is the collection of regions, with different superscripts indicate white noise for a compartment in a specific region, and are NPIs and PIs chosen by the region planners at time . The planner of region minimizes its region’s cost within a period :

We explain the model 26 in detail:


In 2, denotes the average number of contacts of infected people in region with susceptible individuals in region per time unit. Although some regions may not be geographically connected, the transmission between the two is still possible due to air travel, but is still less intensive than the transmission within the region, i.e., and for all . The decision for NPIs of region ’s planner is given by . In particular, it represents the fraction of the population under NPIs (such as social distancing) at time . We assume that those under interventions cannot be infected. However, the policy may only be partially effective as essential activities (food production and distribution, health, and basic services) have to continue. We use to measure this effectiveness. The transition rate under policy thus become . The case means the policy is fully effective. One can also view as the level of public compliance.

The planner of region also makes the decision . This represents the effort, at time , that the planner puts into PIs. We refer to this term, , as the health policy. It will influence the vaccination availability and the recovery rate of this model. denotes the vaccination availability of region at time . In this model, we assume that once vaccinated, the susceptible individuals become immune to the disease, and join the removed category . This assumption is not very consistent with COVID-19 but reasonable for a short-term decision-making problem. We model it as an increasing function of , and if the vaccine has not yet been developed, we can define for .


In 3, describes the latent period when the person is infected but is not yet infectious. It is the inverse of the average latent time and we assume to be identical across all regions. The transition between and is proportional to the fraction of exposed individuals, i.e., .

I and R

In 4 and 5, represents the recovery rate. For the infected individuals, a fraction (including both death and recovery from the infection) joins the removed category per time unit. The rate is determined by the average duration of infection . We model the duration and the recovery rate related to the health policy decided by its planner.

The more effort put into the region (e.g., expanding hospital capacity and creating more drive-thru testing sites), the more clinical resources the region will have and the more resources will be accessible by patients, which could accelerate recovery and slow down death. The death rate, denoted by , is crucial for computing the cost of the region .


In 6, each region planner faces four types of cost. One is the economic activity loss due to the lockdown policy, where is the productivity rate per individual, and is the population of the region . The second one is due to the death of infected individuals. Here, is the death rate which we assume for simplicity to be constant, and denotes the economic cost of each death. The hyperparameter describes how planners weigh deaths and infections as compared to other costs. The third one is the in-patient cost, where is the hospitalization rate, and is the cost per in-patient per day. The last term quantifies the grants for health policies. We choose a quadratic form so that the function is concave in . This is to account for the law of diminishing marginal utility: the marginal utility from each additional unit declines as investment increases. All costs are discounted by an exponential function , where is the risk-free interest rate, to take into account the time preference. Note that region ’s cost depends on all regions’ policies , as appearing in the dynamics of . Thus we write it as . The above model 25 is by no doubt a prototype, and one can generalize it by considering reinfections (adding transmission from to ), asymptomatic population (adding asymptomatic compartment ), different control policy for and (using and in 23), different fatality rates for young and elder population (introducing and in 6).

Nash equilibria and the HJB system

As explained above, the interaction between region planners can be viewed as a noncooperative game, when Nash equilibrium is the notion of optimality.

Definition 1.

A Nash equilibrium (NE) is a tuple such that and ,

where represents strategies of players other than the -th one:

Here denotes the set of admissible strategies for each player and is the produce of copies of . For simplicity, we have assumed that all players take actions in the same space.

Under proper conditions, the NE is obtained by solving -coupled Hamilton–Jacobi–Bellman (HJB) equations via dynamic programming CD18, Section 2.1.4. To simplify the notation, we concatenate the states into a vector form , and denote its dynamics by

For the sake of simplicity, we omit the actual definition of , , and and refer XBH22 for further details. Let be the minimized cost defined in 6 if the system starts at . Then, , solves

with , where is the usual Hamiltonian defined by

Enhanced Deep Fictitious Play

Solving for the NE of the game is equivalent to solving the -coupled HJB equations of dimension defined in Equation 7. Due to the high-dimensionality, this is a formidable numerical challenge. We overcome this through a deep learning methodology we call Enhanced Deep Fictitious Play, being broadly motivated by the method of fictitious play introduced by Brown Bro51.

Deep Learning. Deep learning leverages a class of computational models composed of multiple processing layers to learn representations of data with multiple levels of abstraction LBH15. Deep neural networks are effective tools for approximating unknown functions in high-dimensional space. In recent years, we have witnessed noticeable success in a marriage of deep learning and computational mathematics to solve high-dimensional differential equations. Specifically, deep neural networks show strong capability in solving stochastic control and games HJE18HL22. Below, we use a simple example to illustrate how a deep neural network is determined for function approximation.

Suppose we would like to approximate a map by a neural network in which one seeks to obtain appropriate parameters of the network, , through a process called training. This consists of minimizing a loss function that measures the discrepancies between the approximation and true values over the so-called training set . Such a loss function has the general form

where is a regularization term on the parameters. The first term ensures that the predictions of match approximately the true value on the training set . Here, could be a direct distance like the norm or error terms derived from some complex simulations associated with and . The hyperparameter characterizes the relative importance between the two terms in . To find an optimal set of parameters , one solves the problem of minimizing by the stochastic gradient descent (SGD) method BCN18. Regarding the architecture of , there is a wide variety of choices depending on the problem, for example fully connected neural networks, convolutional neural networks, recurrent neural networks, and transformers. In this work, we chose fully connected neural networks to approximate the solution and constructed the loss function by simulating the backward differential equations corresponding to the HJB equations.

Note that the HJB system 7 is difficult to solve due to the high-dimensionality of the -coupled equations. What if we could decouple the system to separate equations, each of which is easier to solve? This is the central idea of fictitious play, where we update our approximations to the optimal policies of each player iteratively stage by stage. In each stage, instead of updating the approximations of all the players together by solving the giant system, we do it separately and parallelly. Each player solves for her own optimal policy assuming that the other players are taking their approximated optimal strategies from the last stage. Let us denote the optimal policy and corresponding value function of the single player in stage as and , respectively, and the collection of these two quantities for all the players as and . Finally, let us denote the optimal policies and corresponding value functions for all the players except for player as and , where is a concatenation of lockdown policies and vaccination policies, . At stage , we can solve for the optimal policy and value function of player given other players are taken the known policies and the corresponding value . The logic of fictitious play is shown in Figure 2, where players iteratively decide optimal policies in stage , based on other players’ optimal policies in stage . This is slightly different than the usual simultaneous fictitious play, where the belief is described by the time average of past play and the distinction is further discussed in HH20.

Figure 2.

Schematic plot of fictitious play: each player derives optimal policies at stage assuming other players take optimal strategies at stage .

Graphic without alt text

The Enhanced Deep Fictitious Play (DFP) algorithm we have designed, built from the Deep Fictitious Play (DFP) algorithm HH20, reduces time cost from to and memory cost from to , with as the total number of fictitious play iterations.

We illustrate one stage of enhanced deep fictitious play in Figure 3. At the stage, given the optimal policies at the previous stage, for , the algorithm solves the following partial differential equations (PDEs),

with and obtains the optimal strategy of the stage:

For simplicity of notations, we omit the stage number in the superscript in the following discussions. The solution to Equation 8 is approximated by solving the equivalent backward stochastic differential equations (BSDEs) using neural networks HJE18: