This vignette demonstrates three major functions in the brokenstick
package: brokenstick()
, predict()
and plot()
. We also need dplyr
and ggplot2
.
require("brokenstick")
require("dplyr")
library("ggplot2")
For more elaborate documentation, visit https://growthcharts.org/brokenstick/articles/manual/manual.html.
The smocc_200
data in the brokenstick
package contain the heights of 200 Dutch children measured on 10 visits at ages 0-2 years.
<- brokenstick::smocc_200
data head(data, 3)
## # A tibble: 3 × 7
## id age sex ga bw hgt hgt_z
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 10001 0 female 40 3960 52 0.575
## 2 10001 0.0821 female 40 3960 55.6 0.888
## 3 10001 0.159 female 40 3960 58.2 0.797
Figure 1 dispays the data from the first 500 rows as a set of growth curves of Dutch children. Curves are steeper during the first few months, so child growth is faster for young infants. Note also there are more cross-overs during the first half year, whereas fewer occur later. This means that the relative positions have been settled by the age of 2 years, or - put differently - that the correlation between time points at those ages is high.
ggplot(data[1:500, ], aes(x = age, y = hgt_z, group = id, color = as.factor(id))) +
geom_line(size = 0.1) +
geom_point(size = 0.7) +
scale_colour_viridis_d(option = "cividis") +
xlab("Age (years)") +
ylab("Length SDS") +
theme_light() +
theme(legend.position = "none")
Figure 2 dispays the same data, but with the vertical axis changed to Standard Deviation Scores (SDS), or \(Z\)-score. The \(Z\)-score is the height corrected for age relative to the Dutch height reference from the Fourth Dutch Growth Study. The \(Z\)-score transformation takes away the major time trend, so all curves are more or less flat. This allows us to see a more detailed assessment of individual growth.
The plots also show how the measurements are clustered around ten ages: birth, 1, 2, 3, 6, 9, 12, 15, 18 and 24 months. While the design was followed rigorously in the study, some variation in timing is inevitable because of weekends, holidays, sickness, and other events. The timing variation poses a problem because we cannot directly compare the measurement between different children (especially for figure 1). Also, we cannot easily construct the “broad” matrix with 10 time point per child.
Of course, we can divide the time axis into ten age groups, and treat all point within the same age group as being measured at the same point. This is probably a good strategy for nicely looking data - as we have here -, but this approach is problematic in data with irregular time intervals, of there are multiple measurement per age group, if the measurement schedules vary by child, or in data combined from studies that employed different designs.
The brokenstick
package contains tools to approximate the observed data by a series of connecting straight lines. When these lines closely follow the data, we may replace each trajectory by its values at the breakpoints. The statistical analysis can then be done on the regularised trajectories, which is easier than working with the observed data.
We fit a trivial broken stick model with just one line anchored at the minimum and maximum age, and plot the trajectories of three selected children as follows:
set.seed(123)
<- brokenstick(hgt ~ age | id, data)
fit <- c(10001, 10005, 10022)
ids plot(fit, group = ids, what = "all",
xlab = "Age (years)", ylab = "Length (cm)")
The following plot displays the same data, but in standardised units so as to increase the analytic resolution:
<- brokenstick(hgt_z ~ age | id, data)
fit0 plot(fit0, group = ids, what = "all",
xlab = "Age (years)", ylab = "Length (SDS)")
Note that both approximations are quite bad.
The broken stick model describes a trajectory by a series of connected straight lines. We first calculate a model with two connected lines. The first line starts at birth and end at the age of exactly 1 years. The second line spans the period between 1 to 2 years. In addition, the lines must connect at the age of 1 year. We estimate and plot the model as follows:
<- brokenstick(hgt_z ~ age | id, data = data, knots = 0:3)
fit2 plot(fit2, group = ids, xlab = "Age (years)", ylab = "Length (SDS)")
The fit2
object holds the parameter estimates of the model:
fit2
## Class brokenstick (kr)
## Variables hgt_z (outcome), age (predictor), id (group)
## Knots 0 1 2 3
## Mean resid 0.1821095
## R-squared 0.8667319
The console output lists the knots of the model, including the left and right boundary knots at 0 and 3. The row of means
correspond to the fixed effect estimates of the linear mixed model. We may interpret these as the global means. Next, the output lists the variance-covariance matrix of the random effects. The model contains 4 random effects (3 visits + 1 end knot). Finally, the residual variance is the model error at the subject level. These three set of parameters are enough to reconstruct the broken stick model.
The plot shows that the two-line model is still fairly crude. We refine the model in the first two years by adding a knot for each age at which a visit was scheduled. This model can be run as
<- round(c(0, 1, 2, 3, 6, 9, 12, 15, 18, 24)/12, 4)
knots <- brokenstick(hgt_z ~ age | id, data = data,
fit9 knots = knots, boundary = c(0, 3))
This optimization problem is more difficult, so it takes longer to run. In addition, it is common that the optimization routines issue a number of warnings related to the number of random effects relative to the number of observations. While these may appear a little discomforting, we have found that the warnings are generally somewhat at the conservative side, so the parameter estimates are probably still OK.
The nine-line broken stick model fits the observed data very well.
The predict()
function allows us to obtain various types of predictions from the broken stick model. The simplest call
<- predict(fit2)
p1 head(p1)
## .pred
## 1 0.7183916
## 2 0.6590848
## 3 0.6036788
## 4 0.5344755
## 5 0.3544602
## 6 0.1745172
identical(nrow(data), nrow(p1))
## [1] TRUE
produces a tibble
with the one column called .pred
for each row in data
. We can bind column .pred
to data
for further processing.
Sometimes, we also want the prediction at the knot values, for example, to create graphs that contain observed and modelled trajectories. We obtain predictions at the knots by the special x = "knots"
argument, e.g.
<- predict(fit2, x = "knots")
p2 head(p2)
## .source id age sex ga bw hgt hgt_z .pred
## 1 added 10001 0 <NA> NA NA NA NA 0.718391617
## 2 added 10001 1 <NA> NA NA NA NA -0.003981157
## 3 added 10001 2 <NA> NA NA NA NA 0.042961066
## 4 added 10001 3 <NA> NA NA NA NA 0.478864260
## 5 added 10002 0 <NA> NA NA NA NA -0.230643864
## 6 added 10002 1 <NA> NA NA NA NA -0.381720863
nrow(p2)
## [1] 800
The output is more verbose and includes the grid of knots for each child (id
, age
). The column .source
is equal to added
as rows are added to the data.
Note there are also knots at ages 0.00 and 3.00 years. These are boundary knots, and added by the brokenstick()
function. By default, the boundary knots span the age range in the data. The estimate for the knot at the maximum age has no useful interpretation, and should be ignored.
If we wish to obtain estimates at both the knots and the observed data use:
<- predict(fit2, x = "knots", strip_data = FALSE)
p3 table(p3$.source)
##
## added data
## 800 1942
This return 1942 rows for the data and 800 rows for the knots.
The proportion of the variance of the outcome explained by the two-line model is
get_r2(fit2)
## [1] 0.8667319
For the second model we get
get_r2(fit9)
## [1] 0.9846596
so the nine-line broken stick model explains about 98 percent of the variance of the height SDS.
Suppose we are interest in knowing the effect of sex, gestational age and birth weight on the height SDS at the age of 2 years. This is an analysis at the subject level. Let us first extract the subject-level data with variables that vary over subjects only.
<- data %>%
subj select(id, sex, ga, bw) %>%
group_by(id) %>%
slice(1)
head(subj, 3)
## # A tibble: 3 × 4
## # Groups: id [3]
## id sex ga bw
## <dbl> <chr> <dbl> <dbl>
## 1 10001 female 40 3960
## 2 10002 male 38 3210
## 3 10003 female 40 4170
We also need the outcome variable. We take it from the broken stick estimates from the nine line solution and append it to the subject level data.
<- predict(fit9, x = "knots", shape = "wide")
bs <- bind_cols(subj, select(bs, -id))
data head(data, 3)
## # A tibble: 3 × 15
## # Groups: id [3]
## id sex ga bw `0` `0.0833` `0.1667` `0.25` `0.5` `0.75` `1`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10001 female 40 3960 0.571 0.871 0.737 0.662 0.233 -0.191 0.130
## 2 10002 male 38 3210 -0.182 -0.330 -0.254 -0.320 -0.216 -0.279 -0.341
## 3 10003 female 40 4170 1.16 2.12 1.98 2.03 2.04 1.83 1.33
## # … with 4 more variables: `1.25` <dbl>, `1.5` <dbl>, `2` <dbl>, `3` <dbl>
The names of the columns in bs
correspond to the knot values.
The effect of the subject’s sex, gestational age and birth weight on the height SDS at the age of 2 years (here denoted by the variable named 2
) can be estimated as
<- lm(`2` ~ sex + ga + I(bw / 1000), data = data)
fit1_lm summary(fit1_lm)
##
## Call:
## lm(formula = `2` ~ sex + ga + I(bw/1000), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.96585 -0.46305 0.06656 0.63552 2.05131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.30186 1.48019 0.880 0.38020
## sexmale 0.01451 0.12930 0.112 0.91078
## ga -0.07418 0.04359 -1.702 0.09038 .
## I(bw/1000) 0.50933 0.14295 3.563 0.00046 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8849 on 196 degrees of freedom
## Multiple R-squared: 0.06722, Adjusted R-squared: 0.05294
## F-statistic: 4.708 on 3 and 196 DF, p-value: 0.003386
Note that the analysis shows there is a substantial effect of birth weight. Of course, it might be that birth weight is directly related to height at the age of 2 years. Alternatively, the relation could be mediated by birth length. The following model adds birth length (the variable named 0
) to the model:
<- lm(`2` ~ sex + ga + I(bw / 1000) + `0`, data = data)
fit2_lm summary(fit2_lm)
##
## Call:
## lm(formula = `2` ~ sex + ga + I(bw/1000) + `0`, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.00784 -0.50862 0.05078 0.62563 1.89610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.427186 1.673835 2.048 0.0419 *
## sexmale 0.003071 0.127536 0.024 0.9808
## ga -0.093100 0.043585 -2.136 0.0339 *
## I(bw/1000) 0.131026 0.202911 0.646 0.5192
## `0` 0.236379 0.091229 2.591 0.0103 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8723 on 195 degrees of freedom
## Multiple R-squared: 0.09826, Adjusted R-squared: 0.07976
## F-statistic: 5.312 on 4 and 195 DF, p-value: 0.0004414
The effect of birth length on length at age 2 is very strong. There is no separate effect of birth weight anymore, so this analysis suggests that the relation between birth weight and length at age 2 can be explained by their mutual associations to birth length.
This vignette illustrated the use of the brokenstick()
, plot()
and predict()
functions. Other vignettes highlight various other capabilities of the package.