There are several vignettes explaining simulation setup, for basic lifecycle simulations here and here, as well as for human/mosquito epidemiological studies, here and here for SEI/SIS models and here for SEI/SEIR models. However, for actual explorations, many stochastic realizations are needed, as well as summary statistics of those simulations. MGDrivE2 provides tools for storing output, as well as analyzing and summarizing that output. This vignette will use the network SEI/SIS simulation as the example simulation to showcase data storage and analysis.
We start by loading the MGDrivE2 package, as well as the MGDrivE package for access to inheritance cubes, and Matrix for sparse matrices used in migration. We will use the basic cube to simulate Mendelian inheritance for this example.
These are the same parameters found in the epi-network vignette, and are also explained in the lifecycle-node vignette.
We will store output every other day (dt = 2
), as
writing to disk can be slow and this reduces the amount of data to store
by half. The rate that one needs to store data will depend on the
accuracy of one’s parameters - i.e., if your parameters are highly
questionable, storing data every other day will not lose much
information, but if your parameters are very well defined, daily storage
may be necessary for accuracy. Note we set tmax
artifically
low in this vignette to reduce build times as the goal of this article
is to demonstrate CSV processing capabilities of
MGDrivE2.
# entomological and epidemiological parameters
theta <- list(
# lifecycle parameters
qE = 1/4,
nE = 2,
qL = 1/3,
nL = 3,
qP = 1/6,
nP = 2,
muE = 0.05,
muL = 0.15,
muP = 0.05,
muF = 0.09,
muM = 0.09,
beta = 16,
nu = 1/(4/24),
# epidemiological parameters
NH = 1000,
X = 0.25,
f = 1/3,
Q = 0.9,
b = 0.55,
c = 0.15,
r = 1/200,
muH = 1/(62*365),
qEIP = 1/11,
nEIP = 2
)
# simulation parameters
tmax <- 50
dt <- 2
Additionally, we augment the cube with transmission efficiencies; see Smith & McKenzie (2004) for an extensive discussion of these parameters.
With the basic parameters finished, we setup the types of nodes in our network, their movement in relation to each other, and finally the structure of the Petri Net.
# nodetypes
node_list <- c("m", "b", "h")
num_nodes <- length(node_list)
# human movement
h_move <- matrix(data = FALSE, nrow = num_nodes, ncol = num_nodes,
dimnames = list(node_list, node_list))
h_move[2,3] <- TRUE
h_move[3,2] <- TRUE
# mosquito movement
m_move <- matrix(data = FALSE, nrow = num_nodes, ncol = num_nodes,
dimnames = list(node_list, node_list))
m_move[1,2] <- TRUE
m_move[2,1] <- TRUE
# Places and transitions
SPN_P <- spn_P_epiSIS_network(node_list = node_list, params = theta, cube = cube)
SPN_T <- spn_T_epiSIS_network(node_list = node_list, spn_P = SPN_P, params = theta,
cube = cube, h_move = h_move, m_move = m_move)
# Stoichiometry matrix
S <- spn_S(spn_P = SPN_P, spn_T = SPN_T)
The spn_P_*
functions
(spn_P_epiSIS_network()
used here) all return length 2
lists.The first index (ix
) is a list equal to the number of
nodes (so here, length(ix) =
3), which returns the uniquely
numbered “places” for every life/infection/Erlang stage for the entire
network. These correspond to every location in the state vector used
during the simulation.
The second item in those lists (u
) is the name for every
“place”. It is a combination of genotype and node number for eggs,
larvae, pupae, unmated females, and male stages. For females, it is a
combination of their genotype, their mate genotype, their infection
status, and the node number. For humans, it is just infection status and
node number.
The spn_T_*
functions
(spn_T_epiSIS_network()
used here) returns a length 2 list.
The first element (T
) is a list of parameters for every
possible state transition. These are not the rates of transition, just
the fact that it is possible and parameters necessary to describe them.
The second element (v
) is the name for each of those
transitions.
Remember, these are node-by-node equilibria, not an equilibrium over the entire network. We expect some burn-in at the beginning of the simulation as the network reaches equilibrium.
# SEI mosquitoes and SIS humans equilibrium
# outputs required parameters in the named list "params"
# outputs intial equilibrium for adv users, "init
# outputs properly filled initial markings, "M0"
initialCons <- equilibrium_SEI_SIS(params = theta, node_list = node_list,
NF = 500, phi = 0.5, NH = 1000, pop_ratio_H = 0.15,
log_dd = TRUE, spn_P = SPN_P, cube = cube)
The equilibrium_*
functions
(equilibrium_SEI_SIS()
used here) return a length 3 list.
The first element (params
) are all of the parameters
required for the simulation. These include user-supplied parameters from
before (theta
), as well as derived parameters, such as the
density-dependent parameter. The second element is a matrix of
equilibrium populations. The rows are nodes, the columns are life
stages. Any stage that does not exist in a node (human stages in
mosquito-only nodes, mosquito stages in human-only nodes) have
NA
values. The final element is M0
, the
initial marking for the Petri Net. This takes all of the equilibrium
values from the second element, and using the places indices,
distributes populations to their appropriate location in the Petri Net
framework.
Next, we setup the movement rates for mosquitoes and humans, to their respective nodes. See the epi-network vignette for a detailed explanation.
# calculate movement rates and movement probabilities
gam <- calc_move_rate(mu = initialCons$params$muF, P = 0.05)
# set mosquito movement rates/probabilities
# mosquitoes exist in nodes 1 and 2, not 3
mr_mosy <- c(gam, gam, NaN)
mp_mosy <- Matrix::sparseMatrix(i = c(1,2), j = c(2,1), x = 1, dims = dim(m_move))
# set human movement rates/probabilities
# humans exist in nodes 2 and 3, not 1
mr_human <- c(NaN, 1/7, 1/7)
mp_human <- Matrix::sparseMatrix(i = c(2,3), j = c(3,2), x = 1, dims = dim(h_move))
# put rates and probs into the parameter list
initialCons$params$mosquito_move_rates <- mr_mosy
initialCons$params$mosquito_move_probs <- mp_mosy
initialCons$params$human_move_rates <- mr_human
initialCons$params$human_move_probs <- mp_human
Now that all the necessary parameters have been added to the named
list initialCons$params
, we generate the hazard functions,
using the function spn_hazards()
. By specifying
log_dd = TRUE
, we use logistic density dependence for these
simulations.
# exact hazards for integer-valued state space
exact_hazards <- spn_hazards(spn_P = SPN_P, spn_T = SPN_T, cube = cube,
params = initialCons$params, type = "SIS",
log_dd = TRUE, exact = TRUE, tol = NaN,
verbose = FALSE)
The hazard function (spn_hazards()
) returns a list of
functions that define the rates of every transition. It is the same
length as the first element of the SPN_T
, i.e., one rate
for every possible transition in the Petri Net.
Before running any simulations, we need to create some releases, and then setup the folder structure.
We will release 50 adult females with homozygous recessive alleles 2
times, every 7 days, starting at day 10, in node 1 (this allows us to
see movement from node 1 to node 2, before we see an impact on human
disease transmission). Remember, it is critically important that
the event names match a place name in the simulation.
The simulation function checks this and will throw an error if the event
name does not exist as a place in the simulation. This format is used in
MGDrivE2 for consistency with solvers in
deSolve
.
# releases
r_times <- seq(from = 10, length.out = 2, by = 7)
r_size <- 50
events <- data.frame("var" = paste0("F_", cube$releaseType, "_", cube$wildType, "_S_1"),
"time" = r_times,
"value" = r_size,
"method" = "add",
stringsAsFactors = FALSE)
Next, we need to setup the file structure for CSV simulations.
We recommend setting up a new folder for any
simulations. This ensures a clean location, and in the chance someone
deletes everything in the folder, there is nothing important there.
Inside the main folder (main_out
), we need three folders,
for each stage of the analysis. We generally call these
raw
, traces
, and analyzed
, to
describe their purpose. The raw
folder holds all of the raw
output from the simulation. It is further divided into repetition
folders. These can be named anything, we use a numeric naming scheme
that sorts properly in file systems. The traces
folder
holds the first set analysis output. It starts empty, the provided
functions setup the necessary folders, and each repetition will be split
into the appropriate life-stage and node designation. The final folder,
analyzed
, holds the summary statistics from our
simulations.
# main output folder
main_out <- tempdir()
# folders for each stage of analysis
analysis_out <- c("raw", "traces", "analyzed")
# repetitions, 3 of them
rep_out <- formatC(x = 1:3, width=3, format='d', flag='0')
# build analysis folders
analysis_folders <- file.path(main_out, analysis_out)
for(i in analysis_folders){ dir.create(path = i, recursive = TRUE) }
# build repetition folders
rep_folders <- file.path(analysis_folders[1], rep_out)
for(i in rep_folders){ dir.create(i) }
We start by running the simulations. As explained in the inhomogeneous vignette, there are two ways
to perform repetitions - using the internal repetition wrapper, or using
a parallel loop. By providing output folders to
sim_trajectory_CSV()
, we can make use of both methods
simultaneously. However, for this example, we only use the internal
repetitions wrapper.
Additionally, we need to specify what life-stages to store. For this example, we will output all of the stages. However, we recommend only saving the stages necessary for the experiment at hand. Each stage to print has a one-letter designation for the function.
One-Letter Designation | Stage Name |
---|---|
E |
Egg stage |
L |
Larval stage |
P |
Pupal stage |
M |
Adult male stage |
U |
Adult, unmated female stage |
F |
Adult, mated female stage |
H |
Adult humans |
The mated female stage (F
) will print infection states
to different files. This keeps individual files from becoming too
unwieldy.
Finally, we run three stochastic realizations of the simulation,
using the tau
sampler with Δt = 0.1, approximating 10
jumps per day.
# delta t
dt_stoch <- 0.1
# run tau-leaping simulation
sim_trajectory_CSV(x0 = initialCons$M0, t0 = 0, tt = tmax, dt = dt,
dt_stoch = dt_stoch, folders = rep_folders,
stage = c("E", "L", "P", "M", "U", "F", "H"), S = S,
hazards = exact_hazards, events = events, verbose = FALSE)
#> Warning in base_events(x0 = x0, events = events, dt = dt): event times do not correspond to a multiple of dt.
#> event times will be rounded up to the nearest time-step!
We can see all of the files returned by the simulation.
# list all files from the first repetition
list.files(path = rep_folders[1], full.names = FALSE)
#> [1] "E.csv" "FE.csv" "FI.csv" "FS.csv" "H.csv" "L.csv" "M.csv" "P.csv"
#> [9] "U.csv"
Each file begins with the one-letter signifier. Notice that the
female stage (F
) has been expanded into three stages, one
for each infection (SEI) stage. These names must stay
the same, as the following analysis only matches the outputs from
sim_trajectory_CSV()
.
To see how the files are organized, we will open two of them: the egg
stage (E.csv
) and the susceptible female stage
(FS.csv
).
# read the eggs from repetition 1
egg_stage <- read.csv(file = file.path(rep_folders[1], "E.csv"), header = TRUE)
# read susceptible females from repetition 1
fs_stage <- read.csv(file = file.path(rep_folders[1], "FS.csv"), header = TRUE)
First, we will look at the dimensions of each file. Everything is
printed at the same rate, dt
, so the length of the files
should match, but the number of columns is significantly different.
Every file begins with a “Time” column, followed by the stage designations. It is these stage designations that explain why female files are so much larger than the rest. As described earlier, the names for each place are a combination of genotypes and node for most stages, but for adult females, places are labeled by 3 designations. The column labels in the egg file are:
colnames(egg_stage)
#> [1] "Time" "E1_AA_1" "E2_AA_1" "E1_Aa_1" "E2_Aa_1" "E1_aa_1" "E2_aa_1"
#> [8] "E1_AA_2" "E2_AA_2" "E1_Aa_2" "E2_Aa_2" "E1_aa_2" "E2_aa_2"
First column is “Time”, and each subsequent column is the
Erlang-stage, genotype, and node. These labels correspond to the labels
in the SPN_P$u
list, and the labels on the state vector,
M0
.
Looking at the females output, we see how this state space gets large quickly.
colnames(fs_stage)
#> [1] "Time" "F_AA_AA_S_1" "F_AA_Aa_S_1" "F_AA_aa_S_1" "F_Aa_AA_S_1"
#> [6] "F_Aa_Aa_S_1" "F_Aa_aa_S_1" "F_aa_AA_S_1" "F_aa_Aa_S_1" "F_aa_aa_S_1"
#> [11] "F_AA_AA_S_2" "F_AA_Aa_S_2" "F_AA_aa_S_2" "F_Aa_AA_S_2" "F_Aa_Aa_S_2"
#> [16] "F_Aa_aa_S_2" "F_aa_AA_S_2" "F_aa_Aa_S_2" "F_aa_aa_S_2"
Again, the “Time” column is first, followed by female genotypes, mate genotypes, infection stage, and node location. This is why female files have been split into different infection stages.
The raw analysis is large, and difficult to work with. So, as a first
reduction, MGDrivE2 provides the function
split_aggregate_CSV()
to organize the raw output. This
function takes the raw
folder, duplicates the folder
structure in the traces
folder, and reduces the raw output
by several metrics. The aquatic stages are summarized by genotype, with
the option to summarize them by Erlang-stage if desired, and split into
patches. Adult female stages are summarized by female genotype, summing
over male mates, and then split by patch as well. This function expects
file names output from the previous function, and by default looks for
all of the output. If anything is not found, it is skipped.
Additionally, the user can specify subsets of the raw
output and run different analyses over each.
We designate the folder traces
because we can plot
population traces from the output. These are the individual dynamics of
each genotype and each stage for every simulation.
# split everything by patch, aggregate by genotype
split_aggregate_CSV(read_dir = analysis_folders[1], write_dir = analysis_folders[2],
spn_P = SPN_P, t0 = 0, tt = tmax, dt = dt, verbose = FALSE)
Looking at the traces
folder, we see the folder
structure is identical to the raw
folder.
# don't list parent directory
list.dirs(path = analysis_folders[2], recursive = FALSE)
#> [1] "/tmp/RtmpNE1sAY/traces/001" "/tmp/RtmpNE1sAY/traces/002"
#> [3] "/tmp/RtmpNE1sAY/traces/003"
We can look at the output in the first repetition folder.
list.files(path = file.path(analysis_folders[2], rep_out[1]))
#> [1] "E_0001.csv" "E_0002.csv" "FE_0001.csv" "FE_0002.csv" "FI_0001.csv"
#> [6] "FI_0002.csv" "FS_0001.csv" "FS_0002.csv" "H_0002.csv" "H_0003.csv"
#> [11] "L_0001.csv" "L_0002.csv" "M_0001.csv" "M_0002.csv" "P_0001.csv"
#> [16] "P_0002.csv" "U_0001.csv" "U_0002.csv"
Notice, every file begins with our one/two letter stage designation. The number following it is the node number. Remember, our network has 3 nodes, m, b, h. We expect that mosquito stages are found in nodes one and two, while humans are found in nodes 2 and 3.
We look at the reduced versions of the same two files we opened previously to compare the differences.
# read the eggs from repetition 1
egg_stage <- read.csv(file = file.path(analysis_folders[2], rep_out[1], "E_0001.csv"),
header = TRUE)
# read susceptible females from repetition 1
fs_stage <- read.csv(file = file.path(analysis_folders[2], rep_out[1], "FS_0001.csv"),
header = TRUE)
The length of each file is the same, since we need every time point, but the number of columns is significantly less.
Again, every file begins with a “Time” column, but now all files have the same remaining columns as well, just the genotypes.
# eggs
colnames(egg_stage)
#> [1] "Time" "AA" "Aa" "aa"
# females
colnames(fs_stage)
#> [1] "Time" "AA" "Aa" "aa"
Since our genotypes stay the same throughout the population, every mosquito file now has the same columns names, “Time” and then genotypes. Humans are still separated by infection stage, since that is the only way we keep track of them.
Traces make nice squiggly plots, but are not particularly helpful
otherwise. Our final step is calculating summary statistics from the
simulations. MGDrivE2 provides a function
(summarize_stats_CSV()
) to calculate the mean and quantiles
over all simulations. It reads the output of
split_aggregate_CSV()
, and returns the desired
summaries.
Quantiles are calculated empirically, using algorithm 8 from the built-in quantile function.
# mean and 95% quantiles
summarize_stats_CSV(read_dir = analysis_folders[2], write_dir = analysis_folders[3],
spn_P = SPN_P, t0 = 0, tt = tmax, dt = dt, mean = TRUE,
quantiles = c(0.025, 0.975), verbose = FALSE)
Since repetitions are summarized, we do not have the same directory
structure in the analyzed
folder, just the summaries.
list.files(path = analysis_folders[3])
#> [1] "E_Mean_0001.csv" "E_Mean_0002.csv"
#> [3] "E_Quantile_00250_0001.csv" "E_Quantile_00250_0002.csv"
#> [5] "E_Quantile_09750_0001.csv" "E_Quantile_09750_0002.csv"
#> [7] "FE_Mean_0001.csv" "FE_Mean_0002.csv"
#> [9] "FE_Quantile_00250_0001.csv" "FE_Quantile_00250_0002.csv"
#> [11] "FE_Quantile_09750_0001.csv" "FE_Quantile_09750_0002.csv"
#> [13] "FI_Mean_0001.csv" "FI_Mean_0002.csv"
#> [15] "FI_Quantile_00250_0001.csv" "FI_Quantile_00250_0002.csv"
#> [17] "FI_Quantile_09750_0001.csv" "FI_Quantile_09750_0002.csv"
#> [19] "FS_Mean_0001.csv" "FS_Mean_0002.csv"
#> [21] "FS_Quantile_00250_0001.csv" "FS_Quantile_00250_0002.csv"
#> [23] "FS_Quantile_09750_0001.csv" "FS_Quantile_09750_0002.csv"
#> [25] "H_Mean_0002.csv" "H_Mean_0003.csv"
#> [27] "H_Quantile_00250_0002.csv" "H_Quantile_00250_0003.csv"
#> [29] "H_Quantile_09750_0002.csv" "H_Quantile_09750_0003.csv"
#> [31] "L_Mean_0001.csv" "L_Mean_0002.csv"
#> [33] "L_Quantile_00250_0001.csv" "L_Quantile_00250_0002.csv"
#> [35] "L_Quantile_09750_0001.csv" "L_Quantile_09750_0002.csv"
#> [37] "M_Mean_0001.csv" "M_Mean_0002.csv"
#> [39] "M_Quantile_00250_0001.csv" "M_Quantile_00250_0002.csv"
#> [41] "M_Quantile_09750_0001.csv" "M_Quantile_09750_0002.csv"
#> [43] "P_Mean_0001.csv" "P_Mean_0002.csv"
#> [45] "P_Quantile_00250_0001.csv" "P_Quantile_00250_0002.csv"
#> [47] "P_Quantile_09750_0001.csv" "P_Quantile_09750_0002.csv"
#> [49] "U_Mean_0001.csv" "U_Mean_0002.csv"
#> [51] "U_Quantile_00250_0001.csv" "U_Quantile_00250_0002.csv"
#> [53] "U_Quantile_09750_0001.csv" "U_Quantile_09750_0002.csv"
Again, the stage designation is at the front of every file name.
Next, either Mean
or Quantile
. If
Mean
, then the node designation. If Quantile
,
which quantile, and then the node designation.
We will open the egg and susceptible females files, to see the summaries, but the structure of these files is the same as the trace files.
# read the eggs from repetition 1
egg_stage <- read.csv(file = file.path(analysis_folders[3], "E_Mean_0001.csv"),
header = TRUE)
# read susceptible females from repetition 1
fs_stage <- read.csv(file = file.path(analysis_folders[3], "FS_Mean_0001.csv"),
header = TRUE)
# eggs
dim(egg_stage)
#> [1] 26 4
# susceptible females
dim(fs_stage)
#> [1] 26 4
The dimensions are the same as the trace files. We also see that the column names are the same as well.
# eggs
colnames(egg_stage)
#> [1] "Time" "AA" "Aa" "aa"
# females
colnames(fs_stage)
#> [1] "Time" "AA" "Aa" "aa"
The plotting utilities provided by MGDrivE2 do not work on CSV output. Our assumption is that, if one is generating enough data to use the CSV utilities, then there are specific plotting requirements.