The basic unit-level model (Battese, Harter, and Fuller 1988; Rao and Molina 2015) is given by \[ y_j = \beta' x_j + v_{i[j]} + \epsilon_j\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \qquad \epsilon_j \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma^2) \] where \(j\) runs from 1 to \(n\), the number of unit-level observations, \(\beta\) is a vector of regression coefficients for given covariates \(x_j\), and \(v_i\) are random area intercepts.
We use the api
dataset included in packages survey.
library(survey)
data(api)
$cname <- as.factor(apipop$cname)
apipop$cname <- factor(apisrs$cname, levels=levels(apipop$cname)) apisrs
The apipop
data frame contains the complete population whereas apisrs
is a simple random sample from it. The variable cname
is the county name, and we will be interested in estimation at the county level. Not all counties in the population are sampled. In order to be able to make predictions for out-of-sample areas we make sure that the levels of the sample’s cname
variable match those of its population counterpart.
The basic unit-level model with county random area effects is fit as follows
library(mcmcsae)
<- api00 ~
mod reg(~ ell + meals + stype + hsg + col.grad + grad.sch, name="beta") +
gen(factor = ~ iid(cname), name="v")
<- create_sampler(mod, data=apisrs)
sampler <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
sim summary(sim)) (
## llh_ :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## llh_ -1090 4.44 -246 0.127 -1097 -1089 -1083 1229 1
##
## sigma_ :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## sigma_ 56.2 3.03 18.5 0.0675 51.4 56 61.4 2018 1
##
## beta :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## (Intercept) 802.149 25.125 31.93 0.55319 760.38 802.391 844.1195 2063 0.999
## ell -1.955 0.301 -6.49 0.00683 -2.44 -1.953 -1.4649 1942 1.000
## meals -1.959 0.279 -7.03 0.00593 -2.42 -1.958 -1.5001 2206 1.000
## stypeH -106.266 13.403 -7.93 0.26613 -127.99 -106.382 -83.2523 2537 1.002
## stypeM -59.989 11.417 -5.25 0.23423 -78.75 -60.083 -41.3575 2376 1.000
## hsg -0.619 0.411 -1.51 0.00812 -1.29 -0.620 0.0513 2559 1.001
## col.grad 0.584 0.490 1.19 0.01133 -0.22 0.572 1.3960 1870 1.000
## grad.sch 2.123 0.474 4.48 0.00898 1.33 2.130 2.8878 2790 1.000
##
## v_sigma :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## v_sigma 23 6.4 3.59 0.208 13.3 22.5 34 946 1
##
## v :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda -25.3376 15.5 -1.637917 0.397 -51.4 -24.9348 -0.833 1517 1
## Amador 0.7307 24.5 0.029783 0.448 -38.4 0.5594 40.278 3000 1
## Butte 0.0213 24.1 0.000884 0.475 -39.9 0.0545 39.696 2566 1
## Calaveras 7.3363 22.1 0.332020 0.424 -27.5 6.6664 44.519 2721 1
## Colusa -0.1862 24.2 -0.007687 0.442 -40.3 -0.3438 38.628 3000 1
## Contra Costa -9.9189 15.3 -0.648433 0.344 -36.1 -8.9934 14.541 1976 1
## Del Norte 0.9561 23.8 0.040149 0.440 -37.1 0.4333 41.074 2933 1
## El Dorado -0.3595 24.2 -0.014866 0.446 -40.5 0.3054 36.991 2937 1
## Fresno 0.1117 15.3 0.007296 0.351 -25.3 -0.1095 25.895 1900 1
## Glenn -0.2126 23.4 -0.009070 0.433 -38.4 0.0831 38.587 2935 1
## ... 47 elements suppressed ...
We want to estimate the area population means \[
\theta_i = \frac{1}{N_i}\sum_{j \in U_i} y_j\,,
\] where \(U_i\) is the set of units in area \(i\) of size \(N_i\). The MCMC output in variable sim
can be used to obtain draws from the posterior distribution for \(\theta_i\). The \(r\)th draw can be expressed as \[
\theta_{i;r} = \frac{1}{N_i} \left(n_i \bar{y}_i + \beta_r'(t_{x;i} - n_i \bar{x}_i) + (N_i - n_i)v_{i;r} + \sum_{j \in U_i\setminus s_i} \epsilon_{j;r} \right)\,,
\] where \(\bar{y}_i\) is the sample mean of \(y\) in area \(i\) and \(t_{x;i}\) is a vector of population totals for area \(i\).
<- table(apipop$cname)
N <- tapply(apisrs$api00, apisrs$cname, sum)
samplesums is.na(samplesums)] <- 0 # substitute 0 for out-of-sample areas
samplesums[<- match(apisrs$cds, apipop$cds) # population units in the sample
m <- predict(sim, newdata=apipop, labels=names(N),
res fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
show.progress=FALSE)
<- summary(res)) (summ
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda 679 14.88 45.6 0.332 654 679 702 2013 1.000
## Amador 739 31.08 23.8 0.608 689 738 792 2616 1.000
## Butte 680 26.26 25.9 0.568 638 679 722 2141 1.001
## Calaveras 743 27.92 26.6 0.594 697 742 789 2208 1.000
## Colusa 552 31.31 17.6 0.617 501 553 602 2575 1.002
## Contra Costa 727 14.79 49.1 0.273 702 727 750 2933 1.000
## Del Norte 680 32.12 21.2 0.607 628 679 734 2803 0.999
## El Dorado 753 27.00 27.9 0.532 709 754 796 2572 1.001
## Fresno 595 15.13 39.3 0.276 569 595 619 3000 1.002
## Glenn 625 31.31 20.0 0.588 573 625 676 2832 1.000
## Humboldt 701 27.35 25.6 0.543 656 701 745 2536 0.999
## Imperial 558 24.50 22.8 0.506 519 557 600 2346 1.001
## Inyo 706 33.27 21.2 0.708 652 706 761 2212 1.002
## Kern 596 15.05 39.6 0.320 571 596 620 2211 1.001
## Kings 595 22.33 26.6 0.452 556 595 630 2445 1.001
## Lake 662 24.66 26.8 0.452 622 662 702 2982 1.000
## Lassen 706 26.45 26.7 0.487 663 706 750 2946 1.000
## Los Angeles 636 8.17 77.8 0.152 623 636 649 2903 0.999
## Madera 615 19.90 30.9 0.386 582 615 647 2657 1.000
## Marin 817 20.16 40.5 0.368 785 816 850 3000 1.000
## Mariposa 724 36.12 20.1 0.714 666 724 782 2560 1.001
## Mendocino 663 27.39 24.2 0.545 618 663 708 2522 1.001
## Merced 578 23.13 25.0 0.479 541 578 617 2332 1.000
## Modoc 672 29.78 22.6 0.659 624 671 722 2042 1.000
## Mono 705 41.44 17.0 0.783 641 704 773 2800 1.000
## Monterey 628 18.35 34.2 0.340 599 628 658 2912 1.001
## Napa 701 22.05 31.8 0.426 665 701 737 2682 1.001
## Nevada 799 29.10 27.5 0.563 751 799 847 2673 1.000
## Orange 693 15.26 45.4 0.315 668 692 718 2346 1.000
## Placer 772 23.89 32.3 0.489 734 772 813 2382 1.001
## Plumas 691 31.85 21.7 0.629 639 692 742 2562 1.000
## Riverside 636 13.92 45.7 0.270 613 636 658 2652 1.000
## Sacramento 668 14.93 44.8 0.286 644 669 693 2721 1.001
## San Benito 708 29.43 24.1 0.574 661 708 757 2633 1.000
## San Bernardino 647 12.78 50.6 0.233 626 646 668 3000 1.000
## San Diego 704 14.18 49.6 0.320 681 704 728 1967 0.999
## San Francisco 638 19.87 32.1 0.402 607 638 671 2441 1.000
## San Joaquin 630 17.53 35.9 0.368 601 631 658 2266 1.000
## San Luis Obispo 752 23.14 32.5 0.435 715 752 791 2824 0.999
## San Mateo 734 20.99 35.0 0.405 700 734 769 2681 1.000
## Santa Barbara 679 19.35 35.1 0.390 647 679 711 2461 1.000
## Santa Clara 733 15.96 45.9 0.292 705 734 759 2985 1.000
## Santa Cruz 679 19.84 34.3 0.379 646 680 712 2736 1.000
## Shasta 697 21.75 32.0 0.466 663 696 733 2184 1.001
## Sierra 707 42.43 16.7 0.775 637 707 777 3000 1.001
## Siskiyou 698 24.98 27.9 0.456 657 698 738 3000 1.000
## Solano 711 19.97 35.6 0.391 678 711 743 2612 1.000
## Sonoma 725 22.54 32.2 0.443 687 726 761 2583 1.000
## Stanislaus 662 19.57 33.8 0.357 629 661 694 3000 1.003
## Sutter 653 24.32 26.8 0.458 613 653 693 2815 1.000
## Tehama 659 28.99 22.7 0.539 611 659 706 2893 1.000
## Trinity 649 38.10 17.0 0.696 587 651 711 3000 1.001
## Tulare 583 21.90 26.6 0.440 546 584 618 2476 1.000
## Tuolumne 720 30.42 23.7 0.582 671 719 771 2731 1.000
## Ventura 710 18.03 39.4 0.342 680 709 740 2783 1.000
## Yolo 686 24.84 27.6 0.472 645 686 726 2773 1.000
## Yuba 627 28.28 22.2 0.606 581 627 673 2176 1.001
<- c(tapply(apipop$api00, apipop$cname, mean)) # true population quantities
theta plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)
A model with binomial likelihood can also be fit. We now model the target variable sch.wide
, a binary variable indicating whether a school-wide growth target has been met. We use the same mean model structure as above for the linear model, but now using a logistic link function, \[
y_j \stackrel{\mathrm{iid}}{\sim} {\cal Be}(p_j)\,,\\
\mathrm{logit}(p_j) = \beta' x_j + v_{i[j]}\,,\\
v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2)
\]
$target.met <- as.numeric(apisrs$sch.wide == "Yes")
apisrs<- create_sampler(update(mod, target.met ~ .), family="binomial", data=apisrs)
sampler <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
sim summary(sim)
## llh_ :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## llh_ -83.3 2.49 -33.4 0.0607 -87.5 -83.2 -79.4 1685 1
##
## beta :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## (Intercept) 4.563445 1.4934 3.0558 0.058914 2.2238 4.480713 7.14145 643 1.00
## ell -0.035823 0.0150 -2.3838 0.000420 -0.0612 -0.035212 -0.01215 1280 1.00
## meals 0.000808 0.0141 0.0572 0.000454 -0.0223 0.000659 0.02462 971 1.00
## stypeH -2.821852 0.6210 -4.5441 0.019304 -3.8534 -2.800611 -1.85445 1035 1.00
## stypeM -1.651438 0.5873 -2.8118 0.016695 -2.5977 -1.644618 -0.71640 1238 1.00
## hsg -0.025548 0.0223 -1.1470 0.000657 -0.0626 -0.025645 0.01055 1148 1.00
## col.grad -0.037802 0.0269 -1.4045 0.000868 -0.0816 -0.037805 0.00687 961 1.01
## grad.sch 0.014016 0.0316 0.4434 0.001113 -0.0347 0.012320 0.06888 807 1.00
##
## v_sigma :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## v_sigma 0.375 0.286 1.31 0.0123 0.0297 0.32 0.917 535 1.01
##
## v :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda -0.190148 0.433 -0.43866 0.01363 -1.000 -0.07204 0.338 1012 1.003
## Amador 0.004611 0.489 0.00943 0.00915 -0.771 0.00267 0.762 2856 1.000
## Butte -0.000808 0.478 -0.00169 0.00872 -0.759 -0.00202 0.785 3000 0.999
## Calaveras 0.035399 0.465 0.07609 0.00914 -0.675 0.00547 0.845 2588 0.999
## Colusa -0.010062 0.462 -0.02176 0.00846 -0.797 -0.00139 0.725 2991 1.000
## Contra Costa 0.005358 0.404 0.01327 0.00940 -0.666 0.00159 0.674 1846 1.002
## Del Norte -0.007185 0.479 -0.01501 0.00887 -0.792 -0.00109 0.734 2910 1.000
## El Dorado 0.004596 0.451 0.01019 0.00838 -0.729 0.00206 0.721 2896 1.000
## Fresno -0.042824 0.383 -0.11190 0.00892 -0.694 -0.01627 0.544 1841 1.002
## Glenn -0.014269 0.476 -0.02995 0.00871 -0.804 -0.00148 0.730 2994 1.000
## ... 47 elements suppressed ...
To predict the population fractions of schools that meet the growth target by county,
<- tapply(apisrs$target.met, apisrs$cname, sum)
samplesums is.na(samplesums)] <- 0 # substitute 0 for out-of-sample areas
samplesums[<- predict(sim, newdata=apipop, labels=names(N),
res fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
show.progress=FALSE)
<- summary(res)) (summ
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda 0.784 0.0615 12.74 0.001650 0.670 0.792 0.864 1390 1.002
## Amador 0.837 0.1171 7.14 0.002182 0.600 0.800 1.000 2880 1.000
## Butte 0.835 0.0755 11.06 0.001548 0.708 0.833 0.938 2377 1.000
## Calaveras 0.875 0.0953 9.18 0.001884 0.700 0.900 1.000 2558 1.000
## Colusa 0.645 0.1577 4.09 0.002951 0.333 0.667 0.889 2856 1.001
## Contra Costa 0.817 0.0570 14.32 0.001349 0.721 0.821 0.899 1786 1.001
## Del Norte 0.881 0.1079 8.16 0.002191 0.750 0.875 1.000 2424 1.001
## El Dorado 0.850 0.0758 11.23 0.001521 0.725 0.850 0.950 2482 1.002
## Fresno 0.781 0.0618 12.65 0.001463 0.667 0.790 0.871 1783 1.001
## Glenn 0.790 0.1415 5.58 0.002717 0.556 0.778 1.000 2713 1.002
## Humboldt 0.850 0.0746 11.39 0.001533 0.725 0.850 0.950 2369 1.000
## Imperial 0.699 0.1059 6.60 0.002309 0.500 0.700 0.850 2103 1.002
## Inyo 0.777 0.1584 4.91 0.003087 0.571 0.857 1.000 2635 1.000
## Kern 0.804 0.0523 15.36 0.001232 0.706 0.811 0.878 1805 1.000
## Kings 0.850 0.0791 10.74 0.001654 0.720 0.840 0.960 2289 1.001
## Lake 0.825 0.0957 8.62 0.001837 0.636 0.818 0.955 2717 1.002
## Lassen 0.834 0.1100 7.58 0.002184 0.636 0.818 1.000 2534 1.000
## Los Angeles 0.799 0.0431 18.56 0.001040 0.728 0.799 0.869 1714 1.000
## Madera 0.820 0.0753 10.89 0.001539 0.677 0.839 0.935 2391 1.001
## Marin 0.837 0.0707 11.84 0.001705 0.700 0.840 0.940 1721 1.002
## Mariposa 0.860 0.1517 5.67 0.003097 0.600 0.800 1.000 2400 1.002
## Mendocino 0.815 0.0905 9.01 0.001700 0.640 0.840 0.960 2834 0.999
## Merced 0.787 0.0826 9.53 0.001727 0.635 0.794 0.905 2289 1.001
## Modoc 0.832 0.1578 5.27 0.002998 0.600 0.800 1.000 2769 0.999
## Mono 0.748 0.2435 3.07 0.004572 0.333 0.667 1.000 2837 1.000
## Monterey 0.803 0.0682 11.78 0.001688 0.687 0.807 0.916 1633 1.001
## Napa 0.845 0.0790 10.70 0.001538 0.704 0.852 0.963 2635 1.000
## Nevada 0.866 0.0961 9.01 0.001865 0.714 0.857 1.000 2655 0.999
## Orange 0.774 0.0611 12.68 0.001364 0.672 0.775 0.873 2004 1.000
## Placer 0.813 0.0761 10.68 0.001723 0.682 0.818 0.909 1950 0.999
## Plumas 0.738 0.1558 4.74 0.002921 0.444 0.778 1.000 2844 0.999
## Riverside 0.824 0.0527 15.64 0.001133 0.729 0.831 0.895 2163 1.000
## Sacramento 0.847 0.0493 17.20 0.001286 0.764 0.847 0.924 1467 1.001
## San Benito 0.816 0.1262 6.47 0.002513 0.545 0.818 1.000 2522 1.000
## San Bernardino 0.862 0.0422 20.43 0.001159 0.793 0.862 0.934 1326 1.005
## San Diego 0.848 0.0427 19.85 0.000988 0.775 0.850 0.913 1870 1.002
## San Francisco 0.761 0.0740 10.28 0.001524 0.630 0.770 0.870 2359 0.999
## San Joaquin 0.827 0.0576 14.36 0.001311 0.726 0.831 0.911 1932 1.001
## San Luis Obispo 0.854 0.0731 11.69 0.001469 0.725 0.850 0.950 2474 0.999
## San Mateo 0.796 0.0746 10.67 0.001915 0.664 0.804 0.888 1519 1.001
## Santa Barbara 0.828 0.0669 12.39 0.001645 0.716 0.840 0.926 1652 1.001
## Santa Clara 0.782 0.0729 10.73 0.002138 0.645 0.796 0.871 1162 1.000
## Santa Cruz 0.787 0.0753 10.45 0.001667 0.653 0.796 0.898 2044 1.000
## Shasta 0.816 0.0772 10.58 0.001726 0.675 0.825 0.925 1998 1.000
## Sierra 0.723 0.2276 3.18 0.004201 0.333 0.667 1.000 2935 1.000
## Siskiyou 0.842 0.1004 8.39 0.001970 0.667 0.867 1.000 2595 0.999
## Solano 0.866 0.0634 13.67 0.001374 0.750 0.867 0.967 2125 1.002
## Sonoma 0.846 0.0620 13.64 0.001423 0.741 0.852 0.935 1900 1.000
## Stanislaus 0.812 0.0738 11.00 0.001822 0.681 0.824 0.901 1642 1.004
## Sutter 0.832 0.0904 9.20 0.001722 0.650 0.850 0.950 2758 1.000
## Tehama 0.842 0.1004 8.38 0.002095 0.647 0.882 1.000 2297 1.001
## Trinity 0.745 0.2051 3.63 0.004003 0.500 0.750 1.000 2624 1.000
## Tulare 0.831 0.0667 12.45 0.001492 0.709 0.836 0.927 2002 1.000
## Tuolumne 0.878 0.0968 9.07 0.002004 0.746 0.917 1.000 2333 0.999
## Ventura 0.840 0.0493 17.03 0.001061 0.758 0.845 0.913 2159 1.000
## Yolo 0.853 0.0736 11.59 0.001598 0.714 0.857 0.971 2125 1.001
## Yuba 0.793 0.1014 7.82 0.002037 0.632 0.789 0.947 2477 1.000
<- c(tapply(apipop$sch.wide == "Yes", apipop$cname, mean)) # true population quantities
theta plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)