Simulation of life histories

Introduction

The life course or life history is conceptualized as a sequence of states and transitions between states. A state is a personal attribute, e.g. marital status or number of children ever born, or a combination of personal attributes. The set of possible states is the state space. It is the set of possible values of a personal attribute or combination of attributes. An individual’s life history depends on several factors, including individual characteristics, social context and personal experiences. In this paper, the influencing factors are limited to personal characteristics. The individual life history is modelled as a multistate stochastic process. The transitions the individual experiences and the states occupied after the transitions depend on personal characteristics and chance. The parameters of the multistate process are transition rates that vary with age (and covariates). They are estimated from empirical data. Parameter estimators are obtained by combining data from different individuals. As a result, the individual life history generated by the model is synthetic. Individuals are virtual individuals and the population is a virtual population.

The vignette consists of four sections. Section 2 is a brief review of the theory of multistate modelling. Section 3 covers the simulation of a single fertility history. Section 4 is an application. Fertility histories of women in the United States are simulated using data from the Human Fertility Database (HFD) (www.humanfertility.org) and the Human Mortality Database (HMD) (www.mortality.org). Individuals with identical characteristics have different life histories due to random factors ¹. $VirtualPop$ has several vignettes. To list the vignettes, use $browseVignettes("VirtualPop")$ and to see the code, use $edit(vignette(x))$ .

Theoretical background: simulating life histories

Let $_kY(x)$ denote the state individual k occupies at age x. $_kY(x)$ is a discrete random variable, a Bernoulli random variable if r=2 and a multinomial random variable if r>2. The probability that k is in state i at age x is the state probability $_kp_i(x)=Pr \left[ _kY(x) = i\right]$ , with $\sum_{i=1}^{r} {_kp_i(x)=1}$ . The probability that k is in state i at age x and in state j at age $x+\tau$ is the joint probability

$\begin{align} Pr \left[ _kY(x) = i,\text{ } _kY(x+\tau) = j \right] \tag{1} \end{align}$

The probability is the product of the probability that k is in i at x and the conditional probability that k is in j at $x+\tau$ , provided k is in i at x:

$\begin{align} Pr \left[ _kY (x)=i,_kY(x+\tau)=j \right] = Pr \left [ _kY(x+\tau)=j |_kY(x)=i \right] Pr \left[_kY(x)=i \right] \tag{2} \end{align}$

The conditional probability is the transition probability and is denoted by $_kp_{j|i}(x,x+\tau)$ , which is more conveniently written as $_kp_{ij}(x,x+\tau)$ and, if $\tau=1$ , it is also written as $_kp_{ij}(x)$ . The probability that k is in j at $x+\tau$ is

$\begin{align} _kp_j(x+\tau)=\sum_{i=1}^{r} {_kp_{ij}(x,x+\tau)} {_kp_i(x)} \end{align}$

which is known as state equation. It expresses the state probability at $x+\tau$ in terms of the state probabilities at x and the transition probabilities.

State probabilities may be combined in a state vector $_k\mathbf{S}(x)$ and transition probabilities in the transition matrix $_k{\mathbf{P}}(x,x+\tau)$ :

$\begin{align} _k{\mathbf{P}}(x,x+\tau)= \begin{bmatrix} \text{ }_k p_{11}(x) & _k p_{12}(x) & \cdots & _k p_{1r}(x) \\ _k p_{21}(x) & \text{ }_k p_{22}(x) & \cdots & _k p_{2r}(x) \\ \vdots & \vdots & \ddots & \vdots \\ _k p_{r1}(x) & _k p_{r2}(x) & \cdots & \text{ }_k p_{rr}(x) \end{bmatrix} \tag{3} \end{align}$ where $p_{ii}(\tau)=1 - \sum_{j \ne i}{} _k p_{ij} (\tau)$ .

The element $_k p_{ij} (x,x+\tau)$ denotes the probability that k who resides in location i at age x resides in j at age $x+\tau$ . The diagonal element $_kp_{ii} (x,x+\tau)$ is the probability that k, who is in i at x, is also in i at $x+\tau$ . It does not imply that k stays in i from x to $x+\tau$ on a continuous basis. Multiple transitions are allowed during the interval $(x,x+\tau)$ .

The state vector at $x+\tau$ is

$\begin{align} _k\mathbf{S}(x+\tau)=\left[ _k \mathbf{P}(x,x+\tau \right]' \text{ }_k\mathbf{S}(x) \tag{4} \end{align}$

where $_k\mathbf{S}(t)$ is a column vector and ’ denotes transpose. This matrix expression is a fundamental equation in multistate modelling. In this vignette, the state equation is specified at the individual level. The transition probabilities of an individual of age x depend on personal attributes at x, the history of attributes, and contextual factors. In this vignette, the dependencies are restricted.

If $\tau$ is small and tends to zero, $\begin{align} \frac{1}{\tau} \displaystyle \lim_{\tau \to 0} Pr \left[_kY(x+\tau)=j | _kY(x)=i \right]= \text{ }_k \mu_{ij}(x) \end{align}$ is the instantaneous rate of transition (transition intensity) at age x. Multistate models are fully specified by their transition intensities. If the instantaneous transition rates are independent of the history of k, the stochastic process governed by the transition intensities is a continuous-time Markov process. A common assumption in demography is that $_k \mu_{ij}(x)$ is piecewise constant (e.g. age-specific rates).

The instantaneous transition rates are combined in a transition rate matrix: $\begin{align} _k\mathbf{\mu}(x)= \begin{bmatrix} \text{ }_k\mu_{11}(x) & -_k\mu_{12}(x) & \cdots & -_k\mu_{1r}(x) \\ -_k\mu_{21}(x) & \text{ }_k\mu_{22}(x) & \cdots & -_k\mu_{2r}(x) \\ \vdots & \vdots & \ddots & \vdots \\ -_k\mu_{r1}(x) & -_k\mu_{r2}(x) & \cdots & \text{ }_k\mu_{rr}(x) \end{bmatrix} \tag{5} \end{align}$ where $\mu_{ii}(\tau)=\sum_{j=i}{} _k \mu_{ij} (\tau)$ .

The cumulative transition rate during the age interval from 0 to x is the cumulative hazard: $\begin{align} _k\Lambda_{ij}(x)=\int_{0}^{x} {_k\mu_{ij}(x)} \tag{6} \end{align}$

The formulation assumes that, throughout the interval from 0 to x, individual k may enter any state j from any state i unless $_k \mu_{ij}(\tau)$ is zero. A state that can be entered and left is a transient state, while a state that can be entered but not left is an absorbing state. To be able to move from i to j at time $\tau$ , i.e. during the small interval from $\tau$ to $\tau+d\tau$ , individual k must be in i at age $\tau$ . The current state occupied determines possible destination states. The state k occupies at $\tau$ is determined endogenously, i.e. by the model. Individual k is in i at age $\tau$ and at risk of a transition to j if $_kY(\tau)=i$ . Let $_kY_i(\tau)$ be an indicator variable which is one if k is in i at $\tau$ and zero otherwise: $_kY_i(\tau)=I((_kY(\tau)=i)$ . The quantity $_kY_i(\tau) d\tau$ is the time k spends in i during the interval $(\tau,\tau+d\tau)$ . It is also the duration of exposure to the risk (duration at risk) of leaving i. The total duration of exposure before age x is $\int_{0}^{x} {_kY_i(\tau) d\tau}$ . It is the duration of stay on i between 0 and x. The stay does not need to be on a continuous basis. Individual k may leave i and return to i between ages 0 and x. The transition intensity at age x is the product of the instantaneous transition rate and the indicator variable:

$\begin{align} _k\lambda_{ij}(\tau) = _k\mu_{ij}(\tau) _kY_i(\tau) \tag{7} \end{align}$ The function $_k \mu_{ij}(\tau)$ is the hazard function, $_kY_i(\tau)$ the exposure function and $_k\lambda_{ij}(\tau)$ the intensity function (Aalen, Borgan, and Gjessing 2008, 31; Cook and Lawless 2018, 29). The expression $_k\mu_{ij}(\tau) d \tau$ is the probability that a transition occurs during the interval $(\tau,\tau+d\tau)$ , provided k is at risk of experiencing the transition; $_kY_i(\tau) d\tau$ is the duration of exposure during the interval $(\tau,\tau+d\tau)$ ; and $_k\lambda_{ij}(\tau) d\tau$ is the expected number of transitions from i to j by individual k during the interval $(\tau,\tau+d\tau)$ . The integral $\int_{0}^{x} {_k\lambda_{ij}(\tau) d\tau}$ is the expected number of (i,j)-transitions between ages 0 and x. It is denoted by $_kN_{ij}(x)$ . The sequence of random variables $\{ {_kN_{ij}(\tau); 0 \le \tau \le x}\}$ is a counting process (Aalen, Borgan, and Gjessing 2008, 25). The presentation of life history processes as counting processes is a logical approach in demography because event occurrences are related to exposures. Willekens (2014) adopts the counting process perspective in multistate demographic modelling. Notice that the cumulative hazard $_k\Lambda_{ij}(x)$ may also be interpreted as an expected number of transitions between ages 0 and x. It is the expected number, provided individual k remains at risk of an (i,j)-transition during the entire period from 0 to x. If during the interval from 0 to x, individual k is at risk during the subinterval from $x_1$ to $x_2$ with $0 \le x_1,x_2 \le x$ , then the cumulative hazard may be written as

$\begin{align} _k\Lambda_{ij}(x_1,x_2)=\int_{x_1}^{x_2} {_k \mu_{ij}(\tau) d\tau} =\int_{x_1}^{x_2} {_k \mu_{ij}(\tau)_kY_i(\tau) d\tau} =\int_{x_1}^{x_2} {_k \lambda_{ij}(\tau) d\tau } \tag{8} \end{align}$

During the interval, k may stop being at risk by moving to another destination or by censoring the observation. In the simulation, the cumulative hazard needs to be computed for a period during which individual k is exposed to the risk of a (i,j)-transition. Often, a period of exposure is endogenous and starts with the occurrence of another transition. For instance, the birth of a first child triggers a period at risk of a birth of a second child. The probability that k, who is in i at $x_1$ , is in j at age $x_2$ is $exp \left[- _k \Lambda_{ij}(x_1,x_2) \right]$ . The probability that k, who is in i at $x_1$ , is still in i at x $(x>x_1)$ is $\begin{align} _k S_{i+}(x_1,x) =exp \left[-\int_{x_1}^{x} {\sum_{j \ne i} {_k \mu_{ij}(\tau) d\tau} } \right] = exp\left[-\int_{x_1}^{x_2} {_k \mu_{i+}(\tau) d\tau} \right]=exp\left[_k\Lambda_{i+}(x_1,x) \right] \end{align}$ To determine the waiting time to leaving i for individual k, who is in i at x_1, a random number u is drawn from the standard uniform distribution and the root of the equation

$\begin{align} g(x-x_1)=_k \Lambda_{i+}(x_1,\infty)+ln(1-u)=0 \tag{9} \end{align}$ is determined. It gives the duration between $x_1$ and the age at transition, i.e. $x_u-x_1$ , with $x_u$ the age for which the above equation holds. The destination j is determined by sampling the multinomial distribution with parameters $\{ _kp_{i1}(x),_kp_{i2}(x) ,….,_kp_{ir}(x) \}$ $\begin{align} _k p_{ij}^*(x)=\frac{_k\mu_{ij}(x)} {\sum_{j \ne i} {_k\mu_{ij}(x)} } \hspace{1.5cm} {j \ne i} \tag{10} \end{align}$

and $\sum_{j \ne i} {_k}p_{ij}^*(x)=1$ . Sampling a multinomial distribution consists of two steps. First, a random number u is drawn from the standard uniform distribution. Second, the position of u in the parameter space is determined. If $u < {_kp_{i1}^*(x)}$ , the destination is state 1. If ${_kp_{i1}^*(x)} \le u < {_kp_{i2}^*(x)}$ , the destination is state 2, etc. The method is equivalent to creating a vector with endpoints 0 and 1 and breakpoints ${_kp_{i1}^*(x)},{_kp_{i1}^*(x)}+{_kp_{i2}^*(x)},{_kp_{i1}^*(x)}+{_kp_{i2}^*(x)}+{_kp_{i3}^*(x)}$ , etc. and to determine the segment that contains u.

Some transitions may not be possible, leading to an adjustment of the matrix of instantaneous transition rates $_k\mathbf{\mu}(x)$ . In the next section, the above theory is applied to simulate fertility careers. Each female member of the population becomes at risk of conception at menarche and remains at risk until menopause (conditional on survival and in the absence of sterilization). Fertility rates are zero initially, turn positive during adolescence and become zero again at menopause. The age at birth of the first child is determined by drawing a random waiting time from a piecewise exponential distribution with the age-specific first birth rates as parameters. If the waiting time exceeds the maximum reproductive age, the woman remains childless. Assume singleton births (no twins or triplets). Women with a first child are exposed to the risk of a second child. Exposure starts at the age at birth of the first child and ends at the age at birth of the second child or the maximum reproductive age. Women with two children are exposed to the risk of a third child, and so on. Formally, women with i children are at risk of an additional child. The risk period starts at the age of birth of the i-th child and ends at birth of the i+1-st child or the end of the reproductive period (age at menopause). If i=0, exposure starts at the lowest age of the reproductive period (age at menarche).

Assuming no multiple births, a feasible transition is from i births to i+1 births. Other transitions are not possible. Hence, the transition intensities $_k\mu_{ij}(t)$ are 0 except if j=i+1, with i the current number of biological children ever born to k. The matrix of party-specific instantaneous fertility rates is: $\begin{align} _k\mathbf{\mu}(x)= \begin{bmatrix} \text{ }_k\mu_{11}(x) & -_k\mu_{12}(x) & 0 & \cdots & 0 \\ 0 & \text{ }_k\mu_{22}(x) & -_k\mu_{23}(x) & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & \text{ }_k\mu_{rr}(x) \end{bmatrix} \tag{11} \end{align}$ with r the highest parity considered in the analysis and $_k\mu_{rr}(x)$ the instantenous rate that a woman with r children has another child. For instance, if r is five, then a woman with five children may have another child. Beyond parity five, the fertility rate is independent of the parity. The matrix of transition intensities defines a hierarchical model. A hierarchical model is characterized by having r living states connected by r-1 rates of transition (Schoen 2016, 2019).

In the presence of mortality, the reproductive life course may be interrupted by death. Birth of the next child and death are competing risks. In the model, it is assumed that the rate of death is independent of the number of children ever born. In that case, individual lifespans may be simulated first, followed by the simulation of fertility careers.

Illustration: simulation of a single life history

Consider a most simple case. An individual has one attribute with three possible values (A, B and C). The three possible values is the state space. The individual moves between the three states at constant rates. The transition rates are:

A <- matrix(c(   0.0,0.10,0.05,        
                 0.07,0.0,0.03,
                 0.02,0.05,0.0),nrow=3,byrow=TRUE)
namstates <- c("A","B","C")
dimnames(A) <- list(origin=namstates,destination=namstates)
diag(A) <- -rowSums(A)
B <- -A
B
#>       destination
#> origin     A     B     C
#>      A  0.15 -0.10 -0.05
#>      B -0.07  0.10 -0.03
#>      C -0.02 -0.05  0.07

The transition rate matrix B has the format of the matrix in equation 5.

The constant transition rates are used to generate life histories during segments of life. The function $sim.msm()$ of the $msm$ package simulates one realisation from a continuous-time Markov process up to a given time. Note that $sim.msm()$ requires that the states are numeric values. It also requires that the transition matrix has origin in rows and destination in columns (Jackson 2011, 6). The following code chunk generates a history between ages 20 and 40 for an individual who starts in state A at age 20.

bio <- msm::sim.msm (qmatrix=-B,mintime=20,maxtime=40,start=1)
bio
#> $states
#> [1] 1 2 2
#> 
#> $times
#> [1] 20.00000 23.66791 40.00000
#> 
#> $qmatrix
#>       destination
#> origin     A     B     C
#>      A -0.15  0.10  0.05
#>      B  0.07 -0.10  0.03
#>      C  0.02  0.05 -0.07

The function produces a list of three objects:

Sequence of states occupied. The first state is the state at the start of the simulation. The last state is the state occupied at the end of the simulation, which is the same as the state occupied after the last transition.
Ages of transition. The list includes the starting age and ending age.
The transition matrix used to generate the life histories

At the start of the simulation, the individual is in state A (state 1) and exposed to the risk of moving to state B (state 2) or C (state 3). A random draw from the exponential waiting time distribution gives 3.67 years (since the start of the simulation and 23.67 years since the start of the time scale). Since 3.67 is the duration at which the value of the cumulative probability distribution is equal to u, we may determine the value of u produced by the random draw:

The probability that an individual leaves state A before the end of the observation at 40 is $1-exp\left [-0.15*20\right]=0.95$ . Hence, if a random number is drawn that exceeds 0.95, the waiting time to the transition exceeds the observation window. In that case, the transition does not occur due to censoring.

The destination state is determined by a Bernoulli trial with two possible outcomes: B or C and probability $0.10/(0.10+0.05)=0.667$ that B is the outcome.

The function $sim.msm()$ is also used by Cook and Lawless (2018, 339).

After this illustration of the mechanism, we return to fertility. Suppose a female member of the virtual population experiences throughout her life the age- and parity-specific fertility rates of the United States in 2019. The fertility career is a realization (sample path) of a continuous-time Markov process. The function $Sim\_bio()$ of $VirtualPop$ is used to generate the fertility career. Unlike in $sim.msm()$ , the transition rates used by $Sim\_bio()$ are age-specific. The fertility rates of the United States in 2019 are distributed with the $VirtualPop$ package. They are loaded with the $data$ function of R base:

library (VirtualPop)
rates <- NULL
data(rates)

The object $rates$ has three components: (a) the death rates by age and sex (ASDR), (b) the fertility rates by age and parity (ASFR), and (c) the fertility rates in a format required by multistate models (ratesM). The rates for ages 25 to 29 are:

rates$ratesM[26:29,,]
#> , , Origin = par0
#> 
#>     Destination
#> Age     par0     par1 par2 par3 par4 par5 par6
#>   25 0.05250 -0.05250    0    0    0    0    0
#>   26 0.05646 -0.05646    0    0    0    0    0
#>   27 0.06218 -0.06218    0    0    0    0    0
#>   28 0.06976 -0.06976    0    0    0    0    0
#> 
#> , , Origin = par1
#> 
#>     Destination
#> Age  par0    par1     par2 par3 par4 par5 par6
#>   25    0 0.16041 -0.16041    0    0    0    0
#>   26    0 0.15638 -0.15638    0    0    0    0
#>   27    0 0.15205 -0.15205    0    0    0    0
#>   28    0 0.15047 -0.15047    0    0    0    0
#> 
#> , , Origin = par2
#> 
#>     Destination
#> Age  par0 par1    par2     par3 par4 par5 par6
#>   25    0    0 0.14116 -0.14116    0    0    0
#>   26    0    0 0.13336 -0.13336    0    0    0
#>   27    0    0 0.12156 -0.12156    0    0    0
#>   28    0    0 0.11499 -0.11499    0    0    0
#> 
#> , , Origin = par3
#> 
#>     Destination
#> Age  par0 par1 par2    par3     par4 par5 par6
#>   25    0    0    0 0.13117 -0.13117    0    0
#>   26    0    0    0 0.12155 -0.12155    0    0
#>   27    0    0    0 0.11328 -0.11328    0    0
#>   28    0    0    0 0.10449 -0.10449    0    0
#> 
#> , , Origin = par4
#> 
#>     Destination
#> Age  par0 par1 par2 par3    par4     par5 par6
#>   25    0    0    0    0 0.14558 -0.14558    0
#>   26    0    0    0    0 0.13723 -0.13723    0
#>   27    0    0    0    0 0.12618 -0.12618    0
#>   28    0    0    0    0 0.11658 -0.11658    0
#> 
#> , , Origin = par5
#> 
#>     Destination
#> Age  par0 par1 par2 par3 par4    par5     par6
#>   25    0    0    0    0    0 0.14558 -0.14558
#>   26    0    0    0    0    0 0.13723 -0.13723
#>   27    0    0    0    0    0 0.12618 -0.12618
#>   28    0    0    0    0    0 0.11658 -0.11658
#> 
#> , , Origin = par6
#> 
#>     Destination
#> Age  par0 par1 par2 par3 par4 par5 par6
#>   25    0    0    0    0    0    0    0
#>   26    0    0    0    0    0    0    0
#>   27    0    0    0    0    0    0    0
#>   28    0    0    0    0    0    0    0

The following function call simulates the fertility career of a hypothetical person born on June 12th 1990.

popsim <- data.frame (ID=3,
           born=1990.445,
           start=0,
           end=55,
           st_start="par0")
ch <- suppressWarnings(Sim_bio (datsim=popsim,ratesM=rates$ratesM))
ch
#> $age_startSim
#> [1] 0
#> 
#> $age_endSim
#> [1] 56
#> 
#> $nstates
#> [1] 1
#> 
#> $states
#> [1] 1
#> 
#> $path
#> [1] "par0"
#> 
#> $ages_trans
#> [1] 0 0

The individual has a first child at age 28 and a second at age 34. Since the date of birth is known, the calendar dates at childbirth can be determined. The first child is born on July 3rd 2018 and the second on 27th March 2025.

format(lubridate::date_decimal(1990.445+28.0567), "%Y-%m-%d")
#> [1] "2018-07-03"
format(lubridate::date_decimal(1990.445+34.789), "%Y-%m-%d")
#> [1] "2025-03-27"

The simulation window covers the entire reproductive period. The simulation may be interrupted at any age during the reproductive period, analogous to the censoring of a longitudinal observation. To assess the validity of the simulation, the simulation window must coincide with the observation window (see vignette Validity of the simulation). The argument $end$ of the $Sim\_bio()$ function is an individual-level variable indicating the end of the simulation for the individual.

$Sim\_bio()$ is used in the function $Children()$ of $VirtualPop$ . The function $Children()$ produces two objects (data frames), one related to the mother (and father) and the other related to the new-born children. Using $Children()$ to simulate the fertility career of the individual in the virtual population with ID equal to 1, is

Children (dat0=dataLH[c(1,3),1:11],rates)
#> $data
#>   ID gen    sex   bdated   ddated x_D IDpartner IDmother IDfather jch nch id.1
#> 1  1   1 Female 2019.053 2112.222  35      1177       NA       NA  NA   0   NA
#> 3  3   1 Female 2019.158 2091.099  35      9416       NA       NA  NA   0   NA
#>   id.2 id.3 id.4 id.5 id.6 id.7 id.8 id.9 age.1 age.2 age.3 age.4 age.5 age.6
#> 1   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA
#> 3   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA
#>   age.7 age.8 age.9
#> 1    NA    NA    NA
#> 3    NA    NA    NA
#> 
#> $dch
#> [1] NA

Now we turn to an application: the simulation of fertility careers of women in the United States.

Application: fertility histories of women in the United States

The model described in the previous section is used to simulate individual fertility histories in the presence of mortality.The data are fertility and mortality rates of the United States in 2019. The data are downloaded from the Human Mortality Database (HMD) and the Human Fertility Database (HFD). The retrieval of data from the HMD and HFD is described in another vignette ( $Tutorial$ ). The mortality and fertility rates are assumed to remain constant during the simulation. Consequently, the simulated life histories convey information embedded in the mortality and fertility rates of 2019. They should not be interpreted as descriptions of historical trends, projections or forecasts. That requires time-varying rates.

The age-specific fertility and mortality rates of the population of the United States in 2019 are distributed with the $VirtualPop$ package for illustrative purpose. The rates are contained in the data object $rates$ . The rates can be used after being loaded with the $data$ function of R base:

library (VirtualPop)
rates <- NULL
data(rates)

The creation of the virtual population from fertility and mortality rates involves several steps. The remainder of this section is a description of the steps to generate the virtual population. The steps are:

Determine the size of the virtual population
Create a data frame with selected data on each individual in the virtual population.
- Add ID of partner (function $Partnership()$ )
Generate individual fertility histories (for members of initial population - generation 1) (function $Children()$ )
Add, for each new-born, date of birth, sex, ID of mother and ID father
Add, for each new-born, ID of partner
Generate individual fertility histories for members of generation 2 (function $Children()$ )
Generate individual fertility histories for members of generation 3 (function $Children()$ )
Generate individual fertility histories for members of generation 4 (function $Children()$ )

The simulation starts by determining the size of the virtual population. Each individual in the virtual population is assigned a unique identification number (ID) and a date of birth. If all births occur in the same year or same period, then the population consists of a single birth cohort. A real population may be approximated by distributing births over many years. In this section, it is assumed that all births occur in a single year. Dates of birth are obtained by sampling from a probability distribution of births during the year. In this paper, a uniform distribution is assumed (no seasonal effects). The date of birth of an individual is the year of birth plus a fraction of a year, equal to a random number from the standard uniform distribution. The date of birth is expressed as a real number (decimal numeral system) and is referred to as decimal date of birth. The format is particularly useful to compute ages and durations. Decimal dates may be converted into calendar dates. Note that the computer does not store calendar dates, but stores dates as time (seconds or milliseconds) elapsed since a reference date and time, e.g. midnight, 1st January 1970. The conversion into calendar dates is not at all straightforward and software packages may give different results. The conversion should account for months of different length and leap years. The R library contains several date functions that correctly handle chronological objects. For instance, the $date\_decimal()$ function of the $lubridate$ package, which is part of $tidyverse$ , converts a decimal date into a calendar date. For a review and discussion of issues related to ages and durations in the simulation of life histories, see Willekens (2013).

At the start of the simulation, each individual is randomly assigned a sex using the empirical age-specific sex ratio at birth in the base year. Other personal attributes may be added, but they are omitted in this illustration.

The simulation includes a very simple partner formation process. Partners are selected at random from the opposite sex. Since the process is purely random, partnership formation may be simulated to occur at birth. When more boys are born than girls, some males remain without a partner.

For each individual in the virtual population, a data record is created. It contains the individual’s ID, date of birth and sex. It also contains a variable for the generation to which the individual belongs. Members of the initial population constitute generation one. The code is

refyear <- 2019
ages <- c(0:110)
ncohort <- 1000
ID <- 1:ncohort
sex <- rbinom(ncohort,1,prob=1/2.05)
sex <- factor (sex,levels=c(0,1),labels=c("Male","Female"),ordered=TRUE)
# Population size by sex
nmales <- length(sex[sex=="Male"])
nfemales <- length(sex[sex=="Female"])
gen <- rep(1,ncohort) # generation 1
# Decimal date of birth
bdated <- refyear+runif(ncohort)
# Create data frame
d <- data.frame (ID=ID,gen=gen,sex=sex,bdated=bdated,ddated=NA,x_D=NA,IDpartner=NA,IDmother=NA,IDfather=NA,jch=NA,nch=NA)
# Ages at death, obtained by sampling a peicewise-exponential distribution, using the rpexp function of the msm package
d$x_D[d$sex=="Male"] <- msm::rpexp(n=nmales,rate=rates$ASDR[,"Males"],t=ages)
d$x_D[d$sex=="Female"] <- msm::rpexp(n=nfemales,rate=rates$ASDR[,"Females"],t=ages)
# Decimal data of death
d$ddated <- d$bdated+d$x_D

Individual records include empty spaces for the IDs of the partner, the mother and father of the individual. Their IDs are determined later. The spaces are included to facilitate the merging of data on multiple generations into a single data frame.

The simple partnership model is implemented in function $Partnership(d)$ . It accepts the data frame $d$ as input and returns the data frame augmented by the partner’s ID.

d <- VirtualPop::Partnership(dLH=d)

The first lines of the data frame of individual records are:

head(d)

ABCDEFGHIJ0123456789

	ID <int>	gen <dbl>	sex <ord>	bdated <dbl>	ddated <dbl>	x_D <dbl>	IDpartner <int>	IDmother <lgl>	IDfather <lgl>
1	1	1	Female	2019.867	2101.203	81.33588	710	NA	NA
2	2	1	Male	2019.975	2104.618	84.64291	32	NA	NA
3	3	1	Male	2019.365	2106.344	86.97887	658	NA	NA
4	4	1	Male	2019.103	2101.044	81.94038	NA	NA	NA
5	5	1	Female	2019.630	2103.678	84.04821	526	NA	NA
6	6	1	Male	2019.485	2075.911	56.42551	61	NA	NA

For women, ages at childbirth are determined by sampling from waiting time distributions with parameters equal to the fertility rates by age and parity. When a child is born, it receives a unique ID and a new individual data record is created. The generation variable is set to 2. The date of birth is determined by the date of birth of the mother and her exact age at birth of the child. The ID of the mother is added to the child’s record and the child’s ID is added to the data records of the mother and her partner, assumed to be the father. The record linkages enable to track the descending line of offspring and ascending line of parentage. It significantly facilitates the study of genealogies and kinship networks. The new-born is also assigned a sex using the empirical sex ratio at birth and is randomly allocated a partner. The lifespan and the fertility career of each member of the second generation is determined by sampling the appropriate probability distributions. The function $Children(dat0,rates)$ does the computations. The first argument is the individual data structure, $d$ in this illustration, and the second is the object with the transition rates. For instance,

dch1 <- VirtualPop::Children(dat0=d,rates=rates)

The object $dch1$ has two components: $data$ and $dch$ . The first components is comparable to the women’s file in demographic surveys and the second to the children’s file (see e.g. Demographic and Health Survey (DHS) [https://dhsprogram.com]). The object $data$ is the input object $d$ with, for each mother and father, information on

number of children, and
the ID of each child and the mother’s age at birth.

The second object $dch$ is a data frame with individual records for the new-borns (second generation). It has the ID of the child, the generation, sex, date of birth, the birth order (variable $jch$ ), the ID of the mother, the ID of the father, and the date of death of the child. The ID of the father is the ID of the partner of the mother:

dch1$dch$IDfather <- dch1$data$IDpartner[dch1$dch$IDmother]

The first records are shown below.

head(dch1$data)

ABCDEFGHIJ0123456789

	ID <int>	gen <dbl>	sex <ord>	bdated <dbl>	ddated <dbl>	x_D <dbl>	IDpartner <int>	IDmother <lgl>	IDfather <lgl>
1	1	1	Female	2019.867	2101.203	81.33588	710	NA	NA
2	2	1	Male	2019.975	2104.618	84.64291	32	NA	NA
3	3	1	Male	2019.365	2106.344	86.97887	658	NA	NA
4	4	1	Male	2019.103	2101.044	81.94038	NA	NA	NA
5	5	1	Female	2019.630	2103.678	84.04821	526	NA	NA
6	6	1	Male	2019.485	2075.911	56.42551	61	NA	NA

head(dch1$dch)

ABCDEFGHIJ0123456789

	ID <dbl>	gen <dbl>	sex <fct>	bdated <dbl>	ddated <dbl>	x_D <dbl>	IDpartner <lgl>	IDmother <int>	IDfather <int>
1	1001	2	Male	2046.088	2135.726	89.63774	NA	1	710
2	1002	2	Male	2055.839	2142.785	86.94599	NA	1	710
3	1003	2	Male	2041.218	2127.944	86.72528	NA	5	526
4	1004	2	Female	2050.219	2117.535	67.31555	NA	5	526
5	1005	2	Female	2053.708	2141.797	88.08915	NA	5	526
6	1006	2	Male	2054.338	2135.558	81.22066	NA	5	526

The members of the second generation partner with individuals of the same generation. Partners are allocated randomly.

d2 <- VirtualPop::Partnership (dLH=dch1$dch)

The object $d2$ is $dch1\$dch$ augmented by the IDs of partners. The data frame contains the data on the second generation in a format that is consistent with the data frame of the first generation, which facilitates merging the data frames. The $Children()$ function is used to obtain information on the third generation (children of women of the second generation):

dch2 <-  VirtualPop::Children(dat0=d2,rates=rates)

Partnership in the third generation is the outcome of a random search:

d3 <- VirtualPop::Partnership (dLH=dch2$dch)

A similar procedure is followed to generate the fourth generation (children of mothers of the third generation) and higher-order generations.

dch3 <-  VirtualPop::Children(dat0=d3,rates=rates)
d4 <- VirtualPop::Partnership (dLH=dch3$dch)
dch4 <-  VirtualPop::Children(dat0=d4,rates=rates)
d4 <- dch4$data[,1:which (colnames(dch4$data)=="nch")]

The object $d4$ has data on the fourth generation, including on the children of members of the fourth generation.

Data on the four generations is merged to a single data frame.

data4 <- rbind(dch1$data,dch2$data,dch3$data,dch4$data)

The data frame that results is similar to the $dataLH$ data object included in the $VirtualPop$ package.

References

Aalen, Odd, Ornulf Borgan, and Hakon Gjessing. 2008. Survival and Event History Analysis: A Process Point of View. Springer Science & Business Media.

Caswell, Hal. 2009. “Stage, Age and Individual Stochasticity in Demography.” Oikos 118 (12): 1763–82.

Cook, Richard J, and Jerald F Lawless. 2018. Multistate Models for the Analysis of Life History Data. Chapman; Hall/CRC.

Jackson, Christopher. 2011. “Multi-State Models for Panel Data: The Msm Package for R.” Journal of Statistical Software 38: 1–28. https://doi.org/10.18637/jss.v038.i08.

Schoen, Robert. 2016. “Hierarchical Multistate Models from Population Data: An Application to Parity Statuses.” PeerJ 4: e2535.

———. 2019. Analytical Family Demography. Vol. 47. Springer.

Willekens, Frans. 2013. “Chronological Objects in Demographic Research.” Demographic Research 28: 649–80.

———. 2014. Multistate Analysis of Life Histories with R. Springer.

For a discussion of the distinction between individual heterogeneity and individual stochasticity, see Caswell (2009).↩︎

Simulation of life histories

Frans Willekens

2022-06-23

Introduction

Theoretical background: simulating life histories

Illustration: simulation of a single life history

Application: fertility histories of women in the United States

References