Customizing dccvalidator

dccvalidator is intended customizable for different settings, however the built-in Shiny application is designed to ensure that data conforms to a specific project organization structure in which data is organized into studies that are described with a combination of metadata files. These metadata files document the individuals, specimens, and assay(s) that were performed in the study. Documenting the data in this way allows researchers to describe the study at different levels while minimizing repetition. These metadata files can later be joined on the individualID and specimenID columns to create a table of all of the metadata for the study.

For projects like AMP-AD, PsychENCODE, and others that follow the same structure, see the next section (“Customizing the Shiny application”) for how configure an instance of the application for the project.

For projects that do not follow the same structure but wish to use some elements of dccvalidator’s data validation capabilities, see “Extending dccvalidator” below.

Customizing the Shiny application

The Shiny app uses a configuration file to set details such as where to store uploaded files, which metadata templates to validate against, whom to contact with questions, etc.

To create a custom version of the app, you’ll need to follow these steps:

Create a Synapse project or folder with the appropriate permissions to store uploaded files. In AMP-AD, we created a folder to which consortium members have permissions to read and write, but not download. Only the curation team has the ability to download files (in order to assist with debugging). See this example of how to create a project with appropriate permissions.
Fork the dccvalidator GitHub repository.
Create a new configuration in the config.yml file. Note that any values you do not customize will be inherited from the default configuration. The configuration file must have a default configuration.
(Optional): create a pull request with your configuration back to the upstream dccvalidator repository.
Within the file app.R, replace the "default" configuration with the name of your new configuration.
Deploy the application as described in the Deploying dccvalidator vignette.

To install the dccvalidator instead of forking the repository:

Create an app.R file containing the following:
```
library("dccvalidator")
run_app()
```
Create a config.yml file using the configuration options specified below and name the parameters “default”.
```
default:
parent: "syn20400157"
```
Create a Synapse project or folder with the appropriate permissions to store uploaded files. In AMP-AD, we created a folder to which consortium members have permissions to read and write, but not download. Only the curation team has the ability to download files (in order to assist with debugging). See this example of how to create a project with appropriate permissions.

Configuration options

parent: The Synapse project or folder where files will be stored
path_to_markdown: Location of an R Markdown document with app instructions. If you wish to omit instructions, insert !expr NA.
study_table: Synapse ID of a table that lists all of the existing studies in the consortium. It should have a column called StudyName.
annotations_table: Synapse ID of a table that lists allowable annotation keys and values for the consortium. This should follow the same basic format as our Synapse Annotations table, e.g. there must be the following columns: key, value, and columnType. columnType options are STRING, BOOLEAN, INTEGER, DOUBLE.
annotations_link: URL to a list or description of allowable annotation values (so users can read more)
templates_link: URL to the location of metadata templates
study_link_animal: URL to an example study description for an animal model study
study_link_human: URL to an example study description for a human study
teams: The team(s) a user must be a member of in order to use the app. If the user is not in any of the teams, they will see a message telling them they must be added to one of the teams.
templates (including manifest_template, individual_templates, biospecimen_templates, and assay_templates): Synapse IDs of templates to use for validation. These should be either .xlsx or .csv files, where the column names reflect the required columns in the template. If the template is an excel file with multiple sheets, the first sheet will be used.
species_list: List of possible species in the consortium. These are shown as options in the validation UI and control which individual template and biospecimen template the app validates against.
complete_columns: For each metadata file and the manifest, a list of the columns that must be complete (i.e. not contain any missing values or empty strings). For metadata files, this should typically include "individualID" and/or "specimenID".
contact_email: Email address linked in footer for users to contact if they have questions

Extending dccvalidator

Users who wish to use dccvalidator for projects that do not follow the same structure as AMP-AD and PsychENCODE can do so by reusing the existing validation functions in their own R code, implementing new checks as needed, and reusing Shiny modules from dccvalidator.

Using validation functions in scripts

Functions to check data for common quality issues are at the core of dccvalidator. These functions in dccvalidator are all named with the pattern check_*(): check_annotation_keys(), check_annotation_values(), check_cols_empty(), etc. See the function reference for a complete list.

These functions are used in the dccvalidator Shiny app, but they can also be used in scripts or reports. Each function takes data as input and returns results in the form of custom condition objects. The condition objects inherit from R’s "message", "warning", and "error" classes. However, rather than raising a message/warning/error, the check functions in dccvalidator return the condition objects themselves.

library("dccvalidator")
library("readr")

## Load a sample manifest
manifest <- read_tsv(
  system.file("extdata", "test_manifest.txt", package = "dccvalidator")
)

manifest
#> # A tibble: 3 x 12
#>   path  parent specimenID individualID assay fakeAnnotation tissue organ
#>   <chr> <chr>  <chr>      <chr>        <chr>          <dbl> <chr>  <chr>
#> 1 /foo… syn20… a1         P01          rnaS…              1 kleen… wron…
#> 2 /foo… syn20… a2         P02          rnaS…              2 puffs  anot…
#> 3 test… syn20… <NA>       <NA>         <NA>              NA <NA>   <NA> 
#> # … with 4 more variables: metadataType <chr>, isMultiSpecimen <lgl>,
#> #   grant <lgl>, consortium <lgl>

## Check that required columns are complete in the manifest
result <- check_cols_complete(
  manifest,
  required_cols = c("path", "parent", "grant")
)

result
#> <error/check_fail>
#> Some required columns are not complete

The condition objects contain several useful pieces of information. There is a message describing the result of the check, as well as a message describing the expected outcome. For warnings and errors, the data that caused the warning or error is included.

result$message
#> [1] "Some required columns are not complete"
result$behavior
#> [1] "Columns path, parent, grant should be complete."
result$data
#> [1] "grant"

Creating new validation functions

While we have functions for many data validation tasks, users may wish to create their own custom checking functions. The functions check_pass(), check_warn(), and check_fail() will create condition objects from the provided arguments. Here is an example of how one could write a custom function that checks if pH values are within an appropriate range.

dat <- data.frame(pH = c(2, 6, 3, -2, 0, 15))

check_values_ph <- function(data, ph_col = "pH",
                            success_msg = "All pH values are valid",
                            fail_msg = "Some pH values are outside the range 0-14",
                            behavior_msg = "pH values should be between 0-14") {
  values <- data[, ph_col, drop = TRUE]
  if (all(values >= 0 & values <= 14)) {
    check_pass(
      msg = success_msg,
      behavior = behavior_msg
    )
  } else {
    check_fail(
      msg = fail_msg,
      behavior = behavior_msg,
      data = values[values > 14 | values < 0]
    )
  }
}

check_values_ph(dat)
#> <error/check_fail>
#> Some pH values are outside the range 0-14

The check_condition() function can make the above a little more concise:

check_values_ph <- function(data, ph_col = "pH",
                            success_msg = "All pH values are valid",
                            fail_msg = "Some pH values are outside the range 0-14",
                            behavior_msg = "pH values should be between 0-14") {
  values <- na.omit(data[, ph_col, drop = TRUE])
  all_valid <- all(values >= 0 & values <= 14)
  
  check_condition(
    msg = ifelse(all_valid, success_msg, fail_msg),
    behavior = behavior_msg,
    data = if (!all_valid) values[values > 14 | values < 0],
    type = ifelse(all_valid, "check_pass", "check_fail")
  )
}

Reusing app modules

The Shiny app that dccvalidator provides allows users to see the results of data validation. This module (results_boxes_server()/results_boxes_ui()) is exported from dccvalidator and can be reused in other Shiny applications like so:

library("shiny")
library("shinydashboard")

server <- function(input, output) {
  # Load sample data
  manifest <- read_tsv(
    system.file("extdata", "test_manifest.txt", package = "dccvalidator")
  )
  biosp <- read_csv(
    system.file("extdata", "test_biospecimen.csv", package = "dccvalidator")
  )
  
  # Add logic to run the checks you are interested in and store the results in a
  # list. Here is an example:
  res <- list(
    check_cols_complete(
      manifest,
      c("path", "parent", "grant"),
      success_msg = "All required columns present are complete in the manifest",
      fail_msg = "Some required columns are incomplete in the manifest"
    ),
    check_cols_complete(
      biosp,
      "specimenID",
      success_msg = "All required columns present are complete in the biospecimen metadata",
      fail_msg = "Some required columns are incomplete in the biospecimen metadata"
    ),
    check_specimen_ids_dup(
      biosp,
      success_msg = "Specimen IDs in the biospecimen metadata file are unique",
      fail_msg = "Duplicate specimen IDs found in the biospecimen metadata file"
    )
  )
  
  # Show results in boxes
  callModule(results_boxes_server, "Validation Results", res)
}

ui <- function(request) {
  dashboardPage(
    header = dashboardHeader(),
    sidebar = dashboardSidebar(),
    body = dashboardBody(
      includeCSS(
        system.file("app/www/custom.css", package = "dccvalidator")
      ),
      results_boxes_ui("Validation Results")
    )
  )
}

shinyApp(ui, server)