The following is a similar vignette - “How to use rdhs?”. However, this one achieves the same results using the rdhs
client object directly to download datasets etc, rather than using the easier user interfacing functions. This vignette is included to help demonstrate how the pacakge internals are working and adds more information about how the package was designed, which may be useful for anyone adding extra functionality to rdhs
in the future.
If are not bothered by this then please read the main introductory vignette.
rdhs
is a package for management and analysis of Demographic and Health Survey (DHS) data. This includes functionality to:
This process is described below and should cover most functionality that will be needed for working with these datasets.
Install rdhs from github with devtools
:
# install.packages("devtools")
# devtools::install_github("ropensci/rdhs")
library(rdhs)
The DHS API has many endpoints that can be accessed using anyone of dhs_<endpoint>()
functions. All exported functions within rdhs
that start dhs_ interact with a different with a different endpoint of the DHS API. Their website gives great information about the different search terms and filters that can be used, and we have tried to include all of this within the documentation of each function.
One of those endpoints, dhs_data()
, interacts with the the published set of standard health indicator data calculated by the DHS. This endpoint allows use to retrieve a set of health indicators that have been sample weighted to give country, subnational estimates that can be further refined by education and wealth brackets. To do this we use the dhs_data()
endpoint, which we can then either search for specific indicators, or by querying for indicators that have been tagged within specific areas.
## what are the indicators
<- dhs_indicators()
indicators 1,] indicators[
## Definition
## 1 Age-specific fertility rate for the three years preceding the survey for age group 10-14 expressed per 1,000 women
## NumberScale IndicatorType MeasurementType IsQuickStat ShortName
## 1 0 I Rate 0 ASFR 10-14
## IndicatorId Level1 IndicatorTotalId Level2 Level3
## 1 FE_FRTR_W_A10 Fertility Fertility rates Women
## SDRID IndicatorOldId TagIds DenominatorWeightedId
## 1 FEFRTRWA10
## Label IndicatorOrder
## 1 Age specific fertility rate: 10-14 11763005
## Denominator
## 1 Per thousand women years exposed in the period 1-36 months prior to interview
## QuickStatOrder IndicatorSpecial1Id DenominatorUnweightedId
## 1
## IndicatorSpecial2Id
## 1
Each call to a DHS endpoint returns a data.frame
by default with all the results available by default.
The DHS has a unique IndicatorId for each of the statistics it calculates. The definition and specific string for each indicator is included within the IndicatorId and Definition variable:
# grab the first 5 alphabetically
order(indicators$IndicatorId),][1:5,c("IndicatorId", "Definition")] indicators[
## IndicatorId
## 2031 AH_CIGA_M_UNW
## 2026 AH_CIGA_W_10P
## 2023 AH_CIGA_W_12C
## 2024 AH_CIGA_W_35C
## 2025 AH_CIGA_W_69C
## Definition
## 2031 Number of men who smoke cigarettes (unweighted)
## 2026 Percentage of women who smoked 10+ cigarettes in preceding 24 hours
## 2023 Percentage of women who smoked 1-2 cigarettes in preceding 24 hours
## 2024 Percentage of women who smoked 3-5 cigarettes in preceding 24 hours
## 2025 Percentage of women who smoked 6-9 cigarettes in preceding 24 hours
Since there are quite a lot of indicators, it might be easier to first query by tags. The DHS tags their indicators by what areas of demography and health they relate to, e.g. anaemia, literacy, malaria parasitaemia are all specific tags. First let’s look at what the tags are, by interacting with the dhs_tags()
endpoint, before grabbing data that related to malaria parasitaemia in the DRC and Tanzania since 2010:
# What are the tags
<- dhs_tags()
tags
# Let's say we want to view the tags that relate to malaria
grepl("Malaria", tags$TagName), ] tags[
## TagType TagName TagID TagOrder
## 31 0 Malaria Parasitemia 36 540
## 43 2 Select Malaria Indicators 79 1000
# and now let's then grab this data by specifying the countryIds and the survey year starts
<- dhs_data(tagIds = 36,countryIds = c("CD","TZ"),breakdown="subnational",surveyYearStart = 2010)
data 1,] data[
## DataId Indicator SurveyId IsPreferred Value
## 1 941297 Malaria prevalence according to RDT CD2013DHS 1 17.1
## SDRID Precision RegionId SurveyYearLabel SurveyType
## 1 MLPMALCRDT 1 CDDHS2013503010 2013-14 DHS
## SurveyYear IndicatorOrder DHS_CountryCode CILow
## 1 2013 125136010 CD
## CountryName IndicatorType CharacteristicId
## 1 Congo Democratic Republic I 503010
## CharacteristicCategory IndicatorId CharacteristicOrder
## 1 Region ML_PMAL_C_RDT 1503010
## CharacteristicLabel ByVariableLabel DenominatorUnweighted
## 1 Kinshasa 406
## DenominatorWeighted CIHigh IsTotal ByVariableId
## 1 532 0 0
Depending on your analysis this maybe more than enough detail. It is also worth mentioning that this data can also be accessed via DHS STATcompiler if you prefer a click and collect version. However, hopefully one can see that selecting a lot of different indicators for multiple countries and breakdowns should be a lot easier using the rdhs
API interaction. For example we can very quickly find out the trends in antimalarial use in Africa, and see if perhaps antimalarial prescription has decreased after RDTs were introduced (assumed 2010).
# Make an api request
<- dhs_data(indicatorIds = "ML_FEVT_C_AML", surveyYearStart = 2010,breakdown = "subnational")
resp
# filter it to 12 countries for space
<- c("Angola","Ghana","Kenya","Liberia",
countries "Madagascar","Mali","Malawi","Nigeria",
"Rwanda","Sierra Leone","Senegal","Tanzania")
# and plot the results
library(ggplot2)
ggplot(resp[resp$CountryName %in% countries,],
aes(x=SurveyYear,y=Value,colour=CountryName)) +
geom_point() +
geom_smooth(method = "glm") +
theme(axis.text.x = element_text(angle = 90, vjust = .5)) +
ylab(resp$Indicator[1]) +
facet_wrap(~CountryName,ncol = 6)
If we incorrectly entered a filter query (very possible), rdhs
will let us know our request was invalid:
# Make an api request
<- dhs_data(indicatorIds="ML_FEVT_C_AMasfafasfL",
resp surveyYearStart=202231231306,
breakdown="subParTyping")
## Error in handle_api_response(resp, TRUE):
## -> DHS API Request Failed [500]
## -> Error Type: dhs_internal_server_error
You may, however, wish to do more nuanced analysis than the API allows. The following 4 section detail a very basic example of how to quickly identify, download and extract datasets you are interested in.
Let’s say we want to get all the survey data from the Democratic Republic of Congo and Tanzania in the last 5 years (since 2013), which covers the use of rapid diagnostic tests (RDTs) for malaria. To begin we’ll interact with the DHS API to identify our datasets.
To start our extraction we’ll query the surveyCharacteristics endpoint using dhs_surveyCharacteristics()
:
## make a call with no arguments
<- dhs_survey_characteristics()
sc grepl("Malaria", sc$SurveyCharacteristicName), ] sc[
## SurveyCharacteristicID SurveyCharacteristicName
## 57 96 Malaria - DBS
## 58 90 Malaria - Microscopy
## 59 89 Malaria - RDT
## 60 57 Malaria module
## 61 8 Malaria/bednet questions
There are 87 different survey characteristics, with one specific survey characteristic for Malaria RDTs. We’ll use this to then find the surveys that include this characteristic. We can also at this point filter for our desired countries and years. The DHS API allows for countries to be filtered using by their countryIds, which is one of the arguments in dhs_surveys()
. To have a look at what each countries countryId is we can use another of the API endpoints first:
## what are the countryIds
<- dhs_countries(returnFields=c("CountryName", "DHS_CountryCode"))
ids str(ids)
## 'data.frame': 91 obs. of 2 variables:
## $ DHS_CountryCode: chr "AF" "AL" "AO" "AM" ...
## $ CountryName : chr "Afghanistan" "Albania" "Angola" "Armenia" ...
# lets find all the surveys that fit our search criteria
<- dhs_surveys(surveyCharacteristicIds = 89,countryIds = c("CD","TZ"),surveyYearStart = 2013)
survs
# and lastly use this to find the datasets we will want to download and let's download the flat files (.dat) datasets (have a look in the dhs_datasets documentation for all argument options, and fileformat abbreviations etc.)
<- dhs_datasets(surveyIds = survs$SurveyId, fileFormat = "flat")
datasets str(datasets)
## 'data.frame': 19 obs. of 13 variables:
## $ FileFormat : chr "Flat ASCII data (.dat)" "Flat ASCII data (.dat)" "Flat ASCII data (.dat)" "Flat ASCII data (.dat)" ...
## $ FileSize : int 198561 7030083 3226262 8028957 11426382 4794941 1569680 6595349 63022 996906 ...
## $ DatasetType : chr "HIV Datasets" "Survey Datasets" "Survey Datasets" "Survey Datasets" ...
## $ SurveyNum : int 421 421 421 421 421 421 421 421 421 421 ...
## $ SurveyId : chr "CD2013DHS" "CD2013DHS" "CD2013DHS" "CD2013DHS" ...
## $ FileType : chr "HIV Test Results Recode" "Births Recode" "Couples' Recode" "Household Recode" ...
## $ FileDateLastModified: chr "November, 14 2014 12:48:34" "November, 17 2014 15:42:54" "November, 17 2014 15:43:04" "September, 19 2016 09:57:20" ...
## $ SurveyYearLabel : chr "2013-14" "2013-14" "2013-14" "2013-14" ...
## $ SurveyType : chr "DHS" "DHS" "DHS" "DHS" ...
## $ SurveyYear : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ DHS_CountryCode : chr "CD" "CD" "CD" "CD" ...
## $ FileName : chr "CDAR61FL.ZIP" "CDBR61FL.ZIP" "CDCR61FL.ZIP" "CDHR61FL.ZIP" ...
## $ CountryName : chr "Congo Democratic Republic" "Congo Democratic Republic" "Congo Democratic Republic" "Congo Democratic Republic" ...
Lastly, we recommended to download either the spss (.sav), fileFormat = "SV"
, or the flat file (.dat), fileFormat = "FL"
datasets. The flat is quicker, but there are still one or two datasets that don’t read correctly, whereas the .sav files are slower to read in but so far no datasets have been found that don’t read in correctly.
We can now use this to download our datasets for further analysis.
We can now go ahead and download our datasets. To do this we need to first create a client
. The client is an R6 class (similar to R’s built in reference classes and make caching survey and API queries more reproducible) and will be used to log in to your DHS account, download datasets for you, and help query those datasets for the question you are interested in. The client will also cache all of these processes, which really helps increase the reproducibility of your analysis.
In order to set up our credentials we use the function set_rdhs_config()
. This will require providing as arguments your email
and project
for which you want to download datasets from. You will then be prompted for your password.
You can also specify a directory for datasets and API calls to be cached to using cache_path
. If you do not provide an argument for cache_path
you will be prompted to provide permission to rdhs
to save your datasets and API calls within your user cache directory for your operating system. This is to comply with CRAN’s requests for permission to be granted before writing to system files. If you do not grant permission, these will be written within your R temporary directory (as we saw above when we first used one of the functions to query the API). Similarly if you do not also provide an argument for config_path
, this will be saved within your temp directory unless permission is granted. Your config files will always be called “rdhs.json”, so that rdhs
can find them easily.
## create a client
## set up your credentials
<- set_rdhs_config(email = "rdhs.tester@gmail.com",
config project = "Testing Malaria Investigations")
<- client_dhs(config)
client client
## <client_dhs>
## Public:
## available_datasets: function (clear_cache_first = FALSE)
## clear_namespace: function (namespace)
## dhs_api_request: function (api_endpoint, query = list(), api_key = private$api_key,
## extract: function (questions, add_geo = FALSE)
## get_cache_date: function ()
## get_config: function ()
## get_datasets: function (dataset_filenames, download_option = "rds", reformat = FALSE,
## get_downloaded_datasets: function ()
## get_root: function ()
## get_variable_labels: function (dataset_filenames = NULL, dataset_paths = NULL, rm_na = FALSE)
## initialize: function (config, api_key = NULL, root = rappdirs_rdhs())
## save_client: function ()
## set_cache_date: function (date)
## survey_questions: function (dataset_filenames, search_terms = NULL, essential_terms = NULL,
## survey_variables: function (dataset_filenames, variables, essential_variables = NULL,
## Private:
## api_endpoints: data indicators countries surveys surveycharacteristics ...
## api_key: ********
## cache_date: 2018-09-24 16:36:45
## check_available_datasets: function (filenames)
## config: rdhs_config
## na_s: ^na -|^na-|.*-na$|.* - na$| \{NA\}$|.* NA$|.*NA$
## package_version: package_version, numeric_version
## root: C:\Users\Oliver\AppData\Local\Oliver\rdhs\Cache
## storr: storr, R6
## url: https://api.dhsprogram.com/rest/dhs/
## user_declared_root: NULL
Before we use our client to download our datasets, it is worth mentioning that the client can be passed as an argument to any of the API functions we have just seen. This will then cache the results for you, so that if you are working remotely or without a good internet connection you can still return your previous API requests. This is what is happening behind the scenes anyway, as rdhs
will create a client for you when you call an API function (if one does not already exist).
Now back to our dataset downloads. If we have a look back at our datasets object, we’ll see there are 19 datasets listed. However, not all of them will be relevant to our malaria RDT questions. One approach is to head to the DHS website and have a look at the DHS Recodes, and look at the recodes that relate to the surveys. The other alternative is to download all the surveys and then query the variables within them. This is what we’ll demonstrate here as it also demonstrates more of the package’s functionality:
So first we will download all these datasets:
# download datasets
<- client$get_datasets(datasets$FileName) downloads
The function returns a list with a file path to where the downloaded datasets have been saved to. By default the files will download quietly, i.e. no progress is shown. However, if you want to see the progress then you can control this by setting this in your config using the verbose_download
argument.
We can now examine what it is we have actually downloaded, by reading in one of these datasets:
# read in our dataset
<- readRDS(downloads$CDPR61FL) cdpr
The dataset returned here contains all the survey questions within the dataset, and what their survey variables are:
# let's look at the variable_names
head(get_variable_labels(cdpr))
## variable description
## 1 hhid Case Identification
## 2 hvidx Line number
## 3 hv000 Country code and phase
## 4 hv001 Cluster number
## 5 hv002 Household number
## 6 hv003 Respondent's line number (answering Household questionnaire)
# and then the dataset
class(cdpr$hv024)
## [1] "labelled"
This is the default behaviour for the client function get_datasets
- it will download the datasets for you, and then by default save them in your client’s root directory and then unzip them and read them in for you, and save the resultant data.frame as a .rds object within the client’s root directory. You can control this behaviour using the download_option
argument as such:
client$get_datasets(download_option = "zip")
- Just the downloaded zip will be savedclient$get_datasets(download_option = "rds")
- Just the read in rds will be savedclient$get_datasets(download_option = "both")
- The zip is downloaded and saved as well as the read in rdsThe other main reason for reading the dataset in straight away as the default option is that the created table of all the survey variables and their definitions is cached then and there, which then allows us to quickly query for particular search terms or survey variables:
# rapid diagnostic test search
<- client$survey_questions(datasets$FileName, search_terms = "malaria rapid test")
questions
table(questions$dataset_filename)
##
## CDHR61FL CDPR61FL TZHR7AFL TZPR7AFL
## 24 1 48 1
What we see from the questions is that the question “Result of malaria rapid test” appears in a few different datasets. This is because the household member recode datasets (CDPR61SV, TZPR7HSV) stores information about the children in a household, with one row per child, whereas the household recode (CDHR61SV, TZHR7HSV) stores information about the household, and thus flattens the information from each child into different subvariables (hml35$01/02 etc). As such it is easier to extract this information from the household member recodes.
To extract our data we pass our questions object to the client function extract
, which will create a list with each dataset and its extracted data as a data.frame
. We also have the option to add any geographic data available, which will download the geographic data files for you and add this data to you resultant extract:
# let's just use the PR files thus
<- dhs_datasets(surveyIds = survs$SurveyId, fileFormat = "FL", fileType = "PR")
datasets <- client$get_datasets(datasets$FileName)
downloads
# and grab the questions from this again along with also questions detailing the province
<- client$survey_questions(datasets$FileName, search_terms = c("malaria rapid test"))
questions
# and now extract the data
<- client$extract(questions, add_geo = FALSE)
extract
# what does our extract look like
str(extract)
## List of 2
## $ CDPR61FL:Classes 'dhs_dataset' and 'data.frame': 95949 obs. of 2 variables:
## ..$ hml35 : 'labelled' int [1:95949] NA NA NA NA NA NA NA 1 0 NA ...
## .. ..- attr(*, "label")= chr "Result of malaria rapid test"
## .. ..- attr(*, "labels")= Named int [1:3] 0 1 9
## .. .. ..- attr(*, "names")= chr [1:3] "negative" "positive" "missing"
## ..$ SurveyId: chr [1:95949] "CD2013DHS" "CD2013DHS" "CD2013DHS" "CD2013DHS" ...
## $ TZPR7AFL:Classes 'dhs_dataset' and 'data.frame': 64880 obs. of 2 variables:
## ..$ hml35 : 'labelled' int [1:64880] NA NA NA NA NA NA NA 0 NA NA ...
## .. ..- attr(*, "label")= chr "Result of malaria rapid test"
## .. ..- attr(*, "labels")= Named int [1:3] 0 1 9
## .. .. ..- attr(*, "names")= chr [1:3] "negative" "positive" "missing"
## ..$ SurveyId: chr [1:64880] "TZ2015DHS" "TZ2015DHS" "TZ2015DHS" "TZ2015DHS" ...
The resultant extract is a list, with a new element for each different dataset that you have extracted. The responses from the dataset are by default stored as a labelled class from the haven package. This class preserves the original semantics and can easily be coerced to factors with haven::as_factor()
. Special missing values are also preserved. For more info on the labelled class have a look at their github.
We can also query our datasets for the survey question variables. In the example above the survey question was Result of malaria rapid test and the variable was hml35. So if you knew the survey variables that you wanted (either by looking at the Recode file or by looking through the variable_names included in the datasets) then we could search against these. So let’s grab the regions using hv024 using the client function survey_variables()
:
# and grab the questions from this now utilising the survey variables
<- client$survey_variables(datasets$FileName, variables = c("hv024","hml35"))
questions
# and now extract the data
<- client$extract(questions, add_geo = FALSE)
extract2
# quick check
head(extract2$CDPR61FL)
## hv024 hml35 SurveyId
## 1 4 NA CD2013DHS
## 2 4 NA CD2013DHS
## 3 4 NA CD2013DHS
## 4 4 NA CD2013DHS
## 5 4 NA CD2013DHS
## 6 4 NA CD2013DHS
head(extract2$TZPR7HFL)
## NULL
# and just to prove that hml35 did actually read in okay (there are just lots of NA)
table(extract2$CDPR61FL$hml35,useNA = "always")
##
## 0 1 9 <NA>
## 5260 2959 8 87722
We can now combine our two dataframes for further analysis using the rdhs
package function rbind_labelled()
. This function works specifically with our lists of labelled data.frames:
# first let's bind our first extraction, without the hv024
<- rbind_labelled(extract)
extract_bound
head(extract_bound)
## hml35 SurveyId DATASET
## CDPR61FL.1 NA CD2013DHS CDPR61FL
## CDPR61FL.2 NA CD2013DHS CDPR61FL
## CDPR61FL.3 NA CD2013DHS CDPR61FL
## CDPR61FL.4 NA CD2013DHS CDPR61FL
## CDPR61FL.5 NA CD2013DHS CDPR61FL
## CDPR61FL.6 NA CD2013DHS CDPR61FL
# now let's try our second extraction
<- rbind_labelled(extract2) extract2_bound
## Warning in rbind_labelled(extract2): Some variables have non-matching value labels: hv024.
## Inheriting labels from first data frame with labels.
This hasn’t quite done what we might want in the second instance. The hv024 variable stores the regions for these 2 countries, which will not be the same and thus the labels will be different between the two of them. Without specifying any additional arguments rbind_labelled()
will simply use the first data.frames labelling as the default, which will mean that some of the Tanzanian provinces will have been encoded as DRC provinces - not good! (This is a similar problem in nature to say trying to add new character strings to a factored data.frame).
There are a few work arounds. Firstly, we can specify a labels argument to the function which will detail how we should handle different variables. labels is a names list that specifies how to handle each variable. If we simply want to keep all the labels then we us the string “concatenate”:
# lets try concatenating the hv024
<- rbind_labelled(extract2, labels = list("hv024"="concatenate"))
better_bound
head(better_bound$hv024)
## <Labelled integer>
## [1] 6 6 6 6 6 6
##
## Labels:
## value label
## 1 arusha
## 2 bandundu
## 3 bas-congo
## 4 dar es salaam
## 5 dodoma
## 6 equateur
## 7 geita
## 8 iringa
## 9 kagera
## 10 kasai-occidental
## 11 kasai-oriental
## 12 kaskazini pemba
## 13 kaskazini unguja
## 14 katanga
## 15 katavi
## 16 kigoma
## 17 kilimanjaro
## 18 kinshasa
## 19 kusini pemba
## 20 kusini unguja
## 21 lindi
## 22 maniema
## 23 manyara
## 24 mara
## 25 mbeya
## 26 mjini magharibi
## 27 morogoro
## 28 mtwara
## 29 mwanza
## 30 njombe
## 31 nord-kivu
## 32 orientale
## 33 pwani
## 34 rukwa
## 35 ruvuma
## 36 shinyanga
## 37 simiyu
## 38 singida
## 39 sud-kivu
## 40 tabora
## 41 tanga
We could also specify new labels for a variable. For example, imagine the two datasets encoded their RDT responses differently, with the first one as c("No","Yes")
and the other as c("Negative","Positive")
. These would be for our purposes the same response, and so we could either leave it and all our results would use the c("No","Yes")
labelling. But we may want to use the latter as it’s more informative/correct, or we may want to be crystal clear and use c("NegativeTest","PositiveTest")
. we can do that like this:
# lets try concatenating the hv024 and providing new labels
<- rbind_labelled(extract2,
better_bound labels = list("hv024"="concatenate",
"hml35"=c("NegativeTest"=0, "PositiveTest"=1)))
# and our new label
head(better_bound$hml35)
## <Labelled integer>
## [1] NA NA NA NA NA NA
##
## Labels:
## value label
## 0 NegativeTest
## 1 PositiveTest
The other option is to not use the labelled class at all. We can control this when we download our datasets, using the argument reformat=TRUE
. This will ensure that no factors or labels are used and it is just the raw data. When this option is set the object returned by client$get_datasets()
no longer has any labelled classes or factors. However, we can still recover the variable table for a dataset using get_variable_labels()
, which will take any dataset output by get_datasets()
and return a data.frame describing the survey question variables and definitions.
# download the datasets with the reformat arguments
<- client$get_datasets(datasets$FileName, reformat=TRUE)
downloads
# grab the questions but specifying the reformat argument
<- client$survey_variables(datasets$FileName, variables = c("hv024", "hml35"),
questions reformat=TRUE)
# and now extract the data
<- client$extract(questions, add_geo = FALSE)
extract3
# group our results
<- rbind_labelled(extract3)
bound_no_labels
# what does our hv024 look like now
class(bound_no_labels$hv024[1])
## [1] "character"
The hv024 column is now just characters, which is possibly the best option depending on your downstream analysis/preferences. It’s for this reason that the geographic data that is added is never turned into factors or labels.
Lastly, we can now use our extract dataset to carry out some regression analysis, to investigate the relationship between malaria prevalence and the quality of wall materials. To do this we will need to first grab the sample weights and stratification from the surveys, along with the extra variables and we will then check the RDT prevalence calculated using the raw data versus the API:
# grab the additional variable hv023 and hv024 which have the strata and weights respectively
<- client$survey_variables(datasets$FileName,variables = c("hv005","hv021","hv022","hv023","hv024",
questions "hv025","hv214","hml20", "hc1","hml35"))
<- client$extract(questions,TRUE)
extraction
# now concatenate the provinces as before and remove missing responses
<- rbind_labelled(extraction,labels=list("hv024"="concatenate","hv214"="concatenate"))
dat <- dat[-which(dat$hml35==9),] # remove missing responses
dat
# and we are going to compare our extract to the API malaria prevalence by RDT, which is for those between 6 and 59 months
<- dat[-which(!dat$hc1>6 & dat$hc1<=60),]
dat
# create a denominator response for hml35
$hml35denom <- as.integer(!is.na(dat$hml35))
dat$bricks <- dat$hv214 %in% c(8,18,5,9,10)
dat$net <- as.logical(dat$hml20)
dat
# specify the strata and sample weights
$strata <- paste0(dat$hv023,dat$DATASET)
dat$hv005 <- dat$hv005/1e6
dat
# construct a survey design using the survey pacakge
library(survey)
# construct the sample design and calculate the mean and totals
<- survey::svydesign(~CLUSTER+DATASET,data=dat,weight=~hv005)
des <- cbind(survey::svyby(~hml35,by=~DHSREGNA+DATASET, des, survey::svyciprop,na.rm=TRUE),
results ::svyby(~hml35denom,by=~DHSREGNA+DATASET, des, survey::svytotal,na.rm=TRUE))
survey<- results[order(results$DATASET),]
results
# grab the same data from the API
<- dhs_data(countryIds = c("CD","TZ"),indicatorIds = "ML_PMAL_C_RDT",breakdown = "subnational",surveyYearStart = 2013, surveyYearEnd = 2016)
dhs_api_data <- cbind(dhs_api_data$Value,dhs_api_data$DenominatorWeighted,dhs_api_data$CharacteristicLabel, dhs_api_data$SurveyId)
dhs_api_data <- dhs_api_data[!grepl("\\.\\.",dhs_api_data[,3]),] # remove subregions included in Tanzania
api <- api[order(apply(api[,4:3],1,paste,collapse="")),]
api
# bind the results and remove duplicate Region Columns
<- cbind(results[,c(1,3,7)],api[])
comparison names(comparison) <- c("Region","Survey_RDT_Prev","Survey_RDT_Denom","API_RDT_Prev","API_RDT_Denom","API_Regions","SurveyID")
head(comparison[,c(1,2,4,3,5,7)])
## Region Survey_RDT_Prev API_RDT_Prev
## Bandundu.CDPR61FL Bandundu 0.2038607 20.2
## Bas-Congo.CDPR61FL Bas-Congo 0.4761599 47.1
## Equateur.CDPR61FL Equateur 0.2732268 27.4
## Kasai-Occidental.CDPR61FL Kasai-Occidental 0.4507341 44.5
## Kasai-Oriental.CDPR61FL Kasai-Oriental 0.4923113 49.4
## Katanga.CDPR61FL Katanga 0.3989890 38.9
## Survey_RDT_Denom API_RDT_Denom SurveyID
## Bandundu.CDPR61FL 1415.6056 1414 CD2013DHS
## Bas-Congo.CDPR61FL 342.6473 347 CD2013DHS
## Equateur.CDPR61FL 1267.4746 1236 CD2013DHS
## Kasai-Occidental.CDPR61FL 604.6050 612 CD2013DHS
## Kasai-Oriental.CDPR61FL 892.6006 894 CD2013DHS
## Katanga.CDPR61FL 861.2030 844 CD2013DHS
It’s a little off, which will be due to the specific stratification the DHS Program will have used, as well as potentially how they have grouped the primary sampling units. We are hoping to get this information from the DHS for each survey so we can make this process more streamline again.
And lastly we will construct a logistic regression to investigate the relationship between a positive malaria RDT and whether the main walls of an individual’s house were made of bricks or similar, while adjusting for urban/rural (v025
) and fixed effects for each survey.
# contsruct our glm using svyglm and specify quasibinomial to handle the na in hml35
summary(svyglm(hml35 ~ DATASET + hv025 + net + bricks, des, family="quasibinomial"))
##
## Call:
## svyglm(formula = hml35 ~ DATASET + hv025 + net + bricks, des,
## family = "quasibinomial")
##
## Survey design:
## survey::svydesign(~CLUSTER + DATASET, data = dat, weight = ~hv005)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.65528 0.23267 -7.114 3.20e-12 ***
## DATASETTZPR7AFL -0.95553 0.12901 -7.406 4.39e-13 ***
## hv025 0.52558 0.13009 4.040 6.03e-05 ***
## netTRUE 0.06529 0.07163 0.911 0.362
## bricksTRUE -0.72109 0.13517 -5.335 1.36e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.9563804)
##
## Number of Fisher Scoring iterations: 4
What we can see is that a significant negative gradient was associated with walls being made of bricks or similarly good materials in comparison to malaria positivity rates by RDT. What is also interesting is that whether the individual slept under a long lasting insecticidal net (hml20 that we converted to net) was not significant.
Hopefully the above tutorial has shown how the rdhs
package can facilitate both querying the DHS API and hopefully make downloading and interacting with the raw datasets a smoother, more reproducible process. It is worth bearing in mind though, that creating a harmonised dataset is not always as easy as the example above - a lot of the time survey variables differ across years and surveys, which is hopefully when the survey_questions
functionality will make it easier to first filter down to those that include the relevant questions before having to decide which survey questions are valid.
Any suggestions or comments/corrections/errors/ideas please let me know either in the issues or send me an email at “o.watson15@imperial.ac.uk”. And if there is any further functionality that you think you would be useful, then also let me know. :)