Estimating age-specific mortality rates at the subnational level

Introduction

This is a tutorial on estimating age-specific mortality rates at the subnational level, using a model similar to that described in our Demography paper. There are four main steps, which will be described below:

  1. Prepare data and get it in the right format
  2. Choose and create a mortality standard
  3. Fit the model
  4. Analyze results from the model

A few notes on this particular example:

  • I’ll be fitting the model to county-level mortality rates in California over the years 1999 to 2016. These are age-specific mortality rates for both sexes for the age groups <1, 1-4, 5-9, 10-14, 15-19, 20-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75-84, 85+.
  • Data on deaths and populations are publicly available through CDC WONDER. However, age groups where death counts are less than 10 are suppressed, and so for some age group/year/county combinations, there are missing data. Also note that there are no observations for any years for two counties, Sierra and Alpine.
  • All analysis was done in R and the model was fit using JAGS. Other MCMC options such as Stan, WinBUGS or PyMC would probably work just as well.

All the code to reproduce this example can be found here: https://github.com/MJAlexander/states-mortality/tree/master/CA_county_example. Please see the R file CA.R in the code folder.

A note on modeling: there are many adaptions that can be made to this broad model set up, which may be more suitable in different situations. When estimating mortality in your own work, make sure to undergo a suitable validation process to see that the estimates are sensible, and fully test alternatives.

1. Preparing the data

The first step is to obtain data on death counts and population by age (and potentially sex) groups, and get it in the right format for modeling purposes. Note that you need counts, not just the mortality rates, as inputs into the model.

In this example, I downloaded data on death and population counts by county (the files CA.csv and CA_pop.csv in the data folder). Because these two data sources had different age groups available, I had to a bit of cleaning up to make sure everything was consistent. The resulting deaths data has the following form:

##               county code age_group year deaths   pop age          mx
## 1 Alameda County, CA 6001  < 1 year 1999    110 19336   0 0.005688871
## 2 Alameda County, CA 6001  < 1 year 2000    104 19397   0 0.005361654
## 3 Alameda County, CA 6001  < 1 year 2001    133 22044   0 0.006033388
## 4 Alameda County, CA 6001  < 1 year 2002     91 21316   0 0.004269094
## 5 Alameda County, CA 6001  < 1 year 2003     97 21091   0 0.004599118
## 6 Alameda County, CA 6001  < 1 year 2004    111 20339   0 0.005457495

For the JAGS model, the data has to has to be in the form of an array. The notation used throughout the JAGS model is referring to age \(x\), time \(t\), area \(a\) and state \(s\). So both the deaths and population data need to be in the form of an array with dimensions age x time x area x state. I did this in quite an ugly way combining loops and tidyverse, which probably isn’t the most elegant way, but it works :) The resulting deaths data for the first county (Alameda) looks like this:

y.xtas[,,1,1]
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
##  [1,]  110  104  133   91   97  111  105   91  105    89    77    94    79
##  [2,]   17   22   15   13   13   16   10   21   18    15    18    12    12
##  [3,]   14   14   16   15   10   13   NA   15   11    NA    NA    NA    NA
##  [4,]   16   19   17   13   19   13   22   13   NA    15    13    11    11
##  [5,]   48   36   41   57   51   50   51   66   54    47    43    51    34
##  [6,]   71   71   83   76   89   75   72   94   99    89    89    73    69
##  [7,]  194  193  179  195  187  201  171  194  167   169   156   120   173
##  [8,]  426  435  401  396  419  345  343  337  329   310   292   253   262
##  [9,]  779  799  783  836  827  825  812  758  768   735   708   629   677
## [10,] 1078 1069 1009 1066 1130 1064 1084 1106 1211  1176  1141  1159  1319
## [11,] 1779 1645 1531 1517 1445 1452 1362 1385 1368  1341  1333  1348  1353
## [12,] 2748 2792 2841 2714 2719 2517 2512 2395 2316  2271  2103  2088  2137
## [13,] 2629 2687 2720 2617 2748 2586 2782 2831 2874  2974  2922  3064  3041
##       [,14] [,15] [,16] [,17] [,18]
##  [1,]    75    86    76    75    71
##  [2,]    15    10    11    NA    10
##  [3,]    NA    NA    NA    NA    NA
##  [4,]    NA    12    NA    NA    NA
##  [5,]    49    43    38    46    37
##  [6,]    84    74    87    82    77
##  [7,]   163   159   180   178   214
##  [8,]   263   260   261   279   273
##  [9,]   675   647   549   598   603
## [10,]  1283  1273  1301  1291  1274
## [11,]  1528  1599  1624  1698  1750
## [12,]  2019  2040  2015  2059  2137
## [13,]  3253  3376  3276  3476  3462

2. Preparing the mortality standard

The other main inputs to the mortality model are the principal components derived from the mortality standard. Which mortality standard you choose to derive your principal components from depends on your specific problem. In the case of this example, I decided to use state-level mortality schedules for all states in the US over the period 1959–2015. These data are available through the United States Mortality Database.

The code I used to create the principal components using these data are here. Again note that for this particular example, I had to alter the data so that the age groups were consistent.

Once the principal components are obtained, they can be input into the model based on being in a matrix with dimension age x component. Note that the model fitted here uses three components. The inputs are below:

##            V1           V2           V3
## 1  0.20853194  0.630214628  0.287652535
## 2  0.35264710  0.300256735 -0.162737410
## 3  0.38660035  0.252316283 -0.287091193
## 4  0.38118755 -0.017848623 -0.403974070
## 5  0.32970586 -0.252356450 -0.352528704
## 6  0.31523511 -0.359905277 -0.025726734
## 7  0.30949609 -0.383953991  0.306658736
## 8  0.28246520 -0.191162991  0.454141280
## 9  0.24350004 -0.003865125  0.401059537
## 10 0.20458378  0.099035872  0.221135219
## 11 0.16646980  0.139019421  0.085803340
## 12 0.12715285  0.116507026  0.052966450
## 13 0.09332889 -0.169677526 -0.004089363

3. Running the model

Now that we have the required data inputs, the JAGS model can be run. You need to create an input list of all the data required by JAGS, and specify the names of the parameters you would like to monitor and get posterior samples for.

jags.data <- list(y.xtas = y.xtas, 
                  pop.xtas = pop.xtas, 
                  Yx = pcs,
                  S = 1, X= length(age_groups), T = length(years), 
                  n.a=length(counties), n.amax=length(counties), P=3 )

parnames <- c("beta.tas", "mu.beta" ,"sigma.beta", "tau.mu", "u.xtas", "mx.xtas")

Once that is done, the model can be run. Please look at the model text file in reference to the paper to see which variables refer to what aspects. The notation used in the JAGS model is (I hope) fairly consistent with the notation in the paper.

mod <- jags(data = jags.data, 
            parameters.to.save=parnames, 
            n.iter = 30000,
            model.file = "../code/model.txt")

This may take a while to run, so be patient. You can look at a summary of the model estimates like this:

mod$BUGSoutput$summary

Note that the values of all Rhats should be less than 1.1, otherwise the estimates are unreliable and should not be interpreted. If you have Rhats that are greater than 1.1, try running the model for more iterations.

# check all Rhats are less than 1.1
max(mod$BUGSoutput$summary[,"Rhat"])

4. Extract results

Now that we have model estimates, we need to be able to extract them and look at the results. You can get the posterior samples for all parameters by extracting the sims.array from the model object:

mcmc.array <- mod$BUGSoutput$sims.array

Unless you’re interested in the underlying mechanics of the model, you’re probably most interested in the estimates for the age-specific mortality rates, mx.xtas. The sims.array has dimensions number iterations (default 1,000) x number of chains (default 3) x number of parameters. So to look at the posterior samples for mx.xtas[1,1,1,1] for example, you would type:

mcmc.array[,,"mx.xtas[1,1,1,1]"]

Once the posterior samples are obtained, these are used to obtain the best estimate of the parameter (usually the median) and Bayesian credible intervals. For example, a 95% credible interval can be calculated by getting the 2.5th and 97.5th quantile of the posterior samples. Below is a chart that illustrates some of the age-specific mortality estimates for six Californian counties in 2016. Code to generate this chart is included in CA.R.

Once the estimate for mortality rates are extracted, you can also convert these into other mortality measures, such as life expectancy, using standard life table relationships. The code on GitHub includes a function which derives life expectancy from the mx’s, called derive_ex_values. This function is loaded in at the beginning of the CA.R. Code to generate this chart is included at the end of CA.R.

Summary

This document gives a brief introduction into the practicalities of fitting a Bayesian subnational mortality model in R using JAGS. There are many different layers to the model and assumptions associated with it, so it is recommended that the user of this code and model is familiar with the paper and the assumptions outlined in it. Good luck! :)