Exercise 6. Diet data: tabulating incidence rates and modelling with Poisson regression
You may have to install the required packages the first time you use them. You can install a package by install.packages("package_of_interest")
for each package you require.
library(biostat3)
library(dplyr) # for data manipulation
Load the diet data using time-on-study as the timescale.
head(diet)
## id chd y hieng energy job month height weight
## 1 127 0 16.791239 low 2023.25 conductor 2 173.9900 61.46280
## 2 200 0 19.958933 low 2448.68 bank 12 177.8000 73.48320
## 3 198 0 19.958933 low 2281.38 bank 12 NA NA
## 4 222 0 15.394935 low 2467.95 bank 2 158.7500 58.24224
## 5 305 1 1.494866 low 2362.93 bank 1 NA NA
## 6 173 0 15.958932 low 2380.11 conductor 12 164.4904 79.01712
## doe dox dob yoe yox yob y1k
## 1 1960-02-16 1976-12-01 1910-09-27 1960.126 1976.917 1910.737 0.016791239
## 2 1956-12-16 1976-12-01 1909-06-18 1956.958 1976.917 1909.460 0.019958933
## 3 1956-12-16 1976-12-01 1910-06-30 1956.958 1976.917 1910.493 0.019958933
## 4 1957-02-16 1972-07-10 1902-07-11 1957.126 1972.523 1902.523 0.015394935
## 5 1960-01-16 1961-07-15 1913-06-30 1960.041 1961.534 1913.493 0.001494866
## 6 1960-12-16 1976-12-01 1915-06-28 1960.958 1976.917 1915.487 0.015958932
## eng3 X1 X2 X3 eng3low eng3medium eng3high
## 1 low 1 0 0 1 0 0
## 2 low 1 0 0 1 0 0
## 3 low 1 0 0 1 0 0
## 4 low 1 0 0 1 0 0
## 5 low 1 0 0 1 0 0
## 6 low 1 0 0 1 0 0
summary(diet)
## id chd y hieng
## Min. : 1 Min. :0.0000 Min. : 0.2875 low :155
## 1st Qu.: 85 1st Qu.:0.0000 1st Qu.:10.7762 high:182
## Median :169 Median :0.0000 Median :15.4606
## Mean :169 Mean :0.1365 Mean :13.6607
## 3rd Qu.:253 3rd Qu.:0.0000 3rd Qu.:17.0431
## Max. :337 Max. :1.0000 Max. :20.0411
##
## energy job month height
## Min. :1748 driver :102 Min. : 1.000 Min. :152.4
## 1st Qu.:2537 conductor: 84 1st Qu.: 3.000 1st Qu.:168.9
## Median :2803 bank :151 Median : 6.000 Median :173.0
## Mean :2829 Mean : 6.231 Mean :173.4
## 3rd Qu.:3110 3rd Qu.:10.000 3rd Qu.:177.8
## Max. :4396 Max. :12.000 Max. :190.5
## NA's :5
## weight doe dox
## Min. : 46.72 Min. :1956-11-16 Min. :1958-08-29
## 1st Qu.: 64.64 1st Qu.:1959-01-16 1st Qu.:1972-09-29
## Median : 72.80 Median :1960-02-16 Median :1976-12-01
## Mean : 72.54 Mean :1960-06-22 Mean :1974-02-19
## 3rd Qu.: 79.83 3rd Qu.:1961-06-16 3rd Qu.:1976-12-01
## Max. :106.14 Max. :1966-09-16 Max. :1976-12-01
## NA's :4
## dob yoe yox yob
## Min. :1892-01-10 Min. :1957 Min. :1959 Min. :1892
## 1st Qu.:1906-01-18 1st Qu.:1959 1st Qu.:1973 1st Qu.:1906
## Median :1911-02-25 Median :1960 Median :1977 Median :1911
## Mean :1911-01-04 Mean :1960 Mean :1974 Mean :1911
## 3rd Qu.:1915-01-30 3rd Qu.:1961 3rd Qu.:1977 3rd Qu.:1915
## Max. :1930-09-19 Max. :1967 Max. :1977 Max. :1931
##
## y1k eng3 X1 X2
## Min. :0.0002875 low : 75 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0107762 medium:150 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0154606 high :112 Median :0.0000 Median :0.0000
## Mean :0.0136607 Mean :0.2226 Mean :0.4451
## 3rd Qu.:0.0170431 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :0.0200411 Max. :1.0000 Max. :1.0000
##
## X3 eng3low eng3medium eng3high
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.3323 Mean :0.2226 Mean :0.4451 Mean :0.3323
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
(a)
diet <- biostat3::diet
diet$y1k <- diet$y/1000
diet.ir6a <- survRate(Surv(y/1000,chd) ~ hieng, data=diet)
## or
diet %>%
group_by(hieng) %>%
summarise(Event = sum(chd), Time = sum(y1k), Rate = Event/Time, # group sums
CI_low = poisson.test(Event,Time)$conf.int[1],
CI_high = poisson.test(Event,Time)$conf.int[2])
## # A tibble: 2 x 6
## hieng Event Time Rate CI_low CI_high
## <fctr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 low 28 2.059431 13.595992 9.034438 19.64999
## 2 high 18 2.544238 7.074809 4.192980 11.18125
## IRR
with(diet.ir6a, poisson.test(event,tstop))
##
## Comparison of Poisson rates
##
## data: event time base: tstop
## count1 = 28, expected count1 = 20.578, p-value = 0.03681
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
## 1.026173 3.688904
## sample estimates:
## rate ratio
## 1.921747
with(diet.ir6a, poisson.test(rev(event),rev(tstop)))
##
## Comparison of Poisson rates
##
## data: rev(event) time base: rev(tstop)
## count1 = 18, expected count1 = 25.422, p-value = 0.03681
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
## 0.2710832 0.9744943
## sample estimates:
## rate ratio
## 0.5203599
We see that individuals with a high energy intake have a lower CHD incidence rate. The estimated crude incidence rate ratio is 0.52 (95% CI: 0.27, 0.97).
(b)
poisson6b <- glm( chd ~ hieng + offset( log( y1k ) ), family=poisson, data=diet)
summary(poisson6b)
##
## Call:
## glm(formula = chd ~ hieng + offset(log(y1k)), family = poisson,
## data = diet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7382 -0.6337 -0.4899 -0.3891 3.0161
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.6098 0.1890 13.811 <2e-16 ***
## hienghigh -0.6532 0.3021 -2.162 0.0306 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 262.82 on 336 degrees of freedom
## Residual deviance: 258.00 on 335 degrees of freedom
## AIC: 354
##
## Number of Fisher Scoring iterations: 6
eform(poisson6b)
## exp(beta) 2.5 % 97.5 %
## (Intercept) 13.5959916 9.3877130 19.6907371
## hienghigh 0.5203599 0.2878432 0.9407012
eform(poisson6b, method="Profile")
## Waiting for profiling to be done...
## exp(beta) 2.5 % 97.5 %
## (Intercept) 13.5959916 9.1614552 19.2715805
## hienghigh 0.5203599 0.2829171 0.9328392
The point estimate for the IRR calculated by the Poisson regression is the same as the IRR calculated in 6(a). A theoretical observation: if we consider the data as being cross-classified solely by hieng
then the Poisson regression model with one parameter is a saturated model so the IRR estimated from the model will be identical to the ‘observed’ IRR. That is, the model is a perfect fit.
(c)
hist6c <- hist(diet$energy, breaks=25, probability=TRUE, xlab="Energy (units)")
curve(dnorm(x, mean=mean(diet$energy), sd=sd(diet$energy)), col = "red", add=TRUE)
quantile(diet$energy, probs=c(0.01,0.05,0.1,0.25,0.5,0.75,0.90,0.95,0.99))
## 1% 5% 10% 25% 50% 75% 90% 95%
## 1887.268 2177.276 2314.114 2536.690 2802.980 3109.660 3365.644 3588.178
## 99%
## 4046.820
# For kurtosis and skewness, see package e1071
The histogram gives us an idea of the distribution of energy intake. We can also tabulate moments and percentiles of the distribution.
(d)
diet$eng3 <- cut(diet$energy, breaks=c(1500,2500,3000,4500),labels=c("low","medium","high"),
right = FALSE)
cbind(Freq=table(diet$eng3),
Prop=table(diet$eng3)/nrow(diet))
## Freq Prop
## low 75 0.2225519
## medium 150 0.4451039
## high 112 0.3323442
(e)
## rates and IRRs
diet.ir6e <- survRate(Surv(y/1000,chd) ~ eng3, data=diet)
print(diet.ir6e)
## eng3 tstop event rate lower upper
## eng3=low low 0.9466338 16 16.901995 9.660951 27.447781
## eng3=medium medium 2.0172621 22 10.905871 6.834651 16.511619
## eng3=high high 1.6397728 8 4.878725 2.106287 9.613033
# calculate IRR and confidence intervals
with(diet.ir6e, rate[eng3=="medium"] / rate[eng3=="low"])
## [1] 0.6452416
with(diet.ir6e[c(2,1),], { # compare second row with first row
poisson.test(event, tstop)
})
##
## Comparison of Poisson rates
##
## data: event time base: tstop
## count1 = 22, expected count1 = 25.863, p-value = 0.2221
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
## 0.3237007 1.3143509
## sample estimates:
## rate ratio
## 0.6452416
with(diet.ir6e, rate[eng3=="high"] / rate[eng3=="low"])
## [1] 0.2886479
with(diet.ir6e[c(3,1),], { # compare third row with first row
poisson.test(event, tstop)
})
##
## Comparison of Poisson rates
##
## data: event time base: tstop
## count1 = 8, expected count1 = 15.216, p-value = 0.004579
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
## 0.1069490 0.7148284
## sample estimates:
## rate ratio
## 0.2886479
We see that the CHD incidence rate decreases as the level of total energy intake increases.
(f)
diet <- mutate(diet,
X1 = as.numeric(eng3 == "low"),
X2 = as.numeric(eng3 == "medium"),
X3 = as.numeric(eng3 == "high"))
# or
diet <- biostat3::addIndicators(diet, ~eng3+0) %>%
mutate(X1 = eng3low, X2 = eng3medium, X3 = eng3high)
colSums(diet[c("X1","X2","X3")])
## X1 X2 X3
## 75 150 112
(g)
filter(diet, eng3=="low") %>% select(c(energy,eng3,X1,X2,X3)) %>% head
## energy eng3 X1 X2 X3
## 1 2023.25 low 1 0 0
## 2 2448.68 low 1 0 0
## 3 2281.38 low 1 0 0
## 4 2467.95 low 1 0 0
## 5 2362.93 low 1 0 0
## 6 2380.11 low 1 0 0
filter(diet, eng3=="medium") %>% select(c(energy,eng3,X1,X2,X3)) %>% head
## energy eng3 X1 X2 X3
## 1 2664.64 medium 0 1 0
## 2 2533.33 medium 0 1 0
## 3 2854.08 medium 0 1 0
## 4 2673.77 medium 0 1 0
## 5 2766.88 medium 0 1 0
## 6 2586.69 medium 0 1 0
filter(diet, eng3=="high") %>% select(c(energy,eng3,X1,X2,X3)) %>% head
## energy eng3 X1 X2 X3
## 1 3067.36 high 0 0 1
## 2 3298.95 high 0 0 1
## 3 3147.60 high 0 0 1
## 4 3180.47 high 0 0 1
## 5 3045.81 high 0 0 1
## 6 3060.03 high 0 0 1
(h)
poisson6h <- glm( chd ~ X2 + X3 + offset( log( y1k ) ), family=poisson, data=diet )
summary(poisson6h)
##
## Call:
## glm(formula = chd ~ X2 + X3 + offset(log(y1k)), family = poisson,
## data = diet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8231 -0.6052 -0.4532 -0.3650 2.9434
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.8274 0.2500 11.312 < 2e-16 ***
## X2 -0.4381 0.3285 -1.334 0.18233
## X3 -1.2425 0.4330 -2.870 0.00411 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 262.82 on 336 degrees of freedom
## Residual deviance: 253.62 on 334 degrees of freedom
## AIC: 351.62
##
## Number of Fisher Scoring iterations: 6
eform(poisson6h)
## exp(beta) 2.5 % 97.5 %
## (Intercept) 16.9019960 10.3556032 27.5867534
## X2 0.6452416 0.3389050 1.2284761
## X3 0.2886478 0.1235407 0.6744143
Level 1 of the categorized total energy is the reference category. The estimated rate ratio comparing level 2 to level 1 is 0.6452 and the estimated rate ratio comparing level 3 to level 1 is 0.2886.
(i)
poisson6i <- glm( chd ~ X1 + X3 + offset( log( y1k ) ), family=poisson, data=diet )
# or
poisson6i <- glm( chd ~ I(eng3=="low") + I(eng3=="high") + offset( log( y1k ) ), family=poisson, data=diet )
summary( poisson6i )
##
## Call:
## glm(formula = chd ~ I(eng3 == "low") + I(eng3 == "high") + offset(log(y1k)),
## family = poisson, data = diet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8231 -0.6052 -0.4532 -0.3650 2.9434
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3893 0.2132 11.207 <2e-16 ***
## I(eng3 == "low")TRUE 0.4381 0.3285 1.334 0.1823
## I(eng3 == "high")TRUE -0.8044 0.4129 -1.948 0.0514 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 262.82 on 336 degrees of freedom
## Residual deviance: 253.62 on 334 degrees of freedom
## AIC: 351.62
##
## Number of Fisher Scoring iterations: 6
eform( poisson6i )
## exp(beta) 2.5 % 97.5 %
## (Intercept) 10.9058706 7.1810131 16.562846
## I(eng3 == "low")TRUE 1.5498071 0.8140167 2.950679
## I(eng3 == "high")TRUE 0.4473485 0.1991681 1.004783
Now use level 2 as the reference (by omitting X2 but including X1 and X3). The estimated rate ratio comparing level 1 to level 2 is 1.5498 and the estimated rate ratio comparing level 3 to level 2 is 0.4473.
(j)
poisson6j <- glm( chd ~ eng3 + offset( log( y1k ) ), family=poisson, data=diet )
summary( poisson6j )
##
## Call:
## glm(formula = chd ~ eng3 + offset(log(y1k)), family = poisson,
## data = diet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8231 -0.6052 -0.4532 -0.3650 2.9434
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.8274 0.2500 11.312 < 2e-16 ***
## eng3medium -0.4381 0.3285 -1.334 0.18233
## eng3high -1.2425 0.4330 -2.870 0.00411 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 262.82 on 336 degrees of freedom
## Residual deviance: 253.62 on 334 degrees of freedom
## AIC: 351.62
##
## Number of Fisher Scoring iterations: 6
eform( poisson6j )
## exp(beta) 2.5 % 97.5 %
## (Intercept) 16.9019960 10.3556032 27.5867534
## eng3medium 0.6452416 0.3389050 1.2284761
## eng3high 0.2886478 0.1235407 0.6744143
The estimates are identical (as we would hope) when we have R create indicator variables for us.
(k)
Somehow (there are many different alternatives) you’ll need to calculate the total number of events and the total person-time at risk and then calculate the incidence rate as events/person-time. For example,
summarise(diet, rate = sum(chd) / sum(y))
## rate
## 1 0.009992031
The estimated incidence rate is 0.00999 events per person-year.