Introduction

This vignette provides an overview of the tcpl package. The tables describing the data structure of the ToxCast database, invitrodb, are included in the Appendices.

A. Overview

The tcpl package was developed to process high-throughput and high-content screening data generated by the U.S. Environmental Protection Agency (EPA) ToxCast^TM program¹. ToxCast is screening thousands of chemicals with hundreds of assays coming from numerous and diverse biochemical and cell-based technology platforms. The diverse data, received in heterogeneous formats from numerous vendors, are transformed to a standard computable format and loaded into the tcpl database by vendor-specific R scripts. Once data is loaded into the database, ToxCast utilizes the generalized processing functions provided in this package to process, normalize, model, qualify, flag, inspect, and visualize the data. While developed primarily for ToxCast, we have attempted to make the tcpl package generally applicable to the chemical-screening community.

The tcpl package includes processing functionality for two screening paradigms: (1) single-concentration screening and (2) multiple-concentration screening. Single-concentration screening consists of testing chemicals at one concentration, often for the purpose of identifying potentially active chemicals to test in the multiple-concentration format. Multiple-concentration screening consists of testing chemicals across a concentration range, such that the modeled activity can give an estimate of potency, efficacy, etc. This version of the package has an added functionality of evaluating the uncertainty in curvefitting (Appendix E).

Prior to the pipeline processing provided in this package, all the data must go through pre-processing (level 0). Level 0 pre-processing utilizes dataset-specific R scripts to process the heterogeneous data into a uniform format and to load the uniform data into the tcpl database. Level 0 pre-processing is outside the scope of this package, but can be done for virtually any high-throughput or high-content chemical screening effort, provided the resulting data includes the minimum required information.

In addition to storing the data, the tcpl database stores every processing/analysis decision at the assay component or assay endpoint level to facilitate transparency and reproducibility. For the illustrative purposes of this vignette, we have included a CSV version of the tcpl database containing a small subset of data from the ToxCast program. Using tcplLite, the user can upload a flat file, in ToxCast database format, and process the data using the tcpl analysis protocols. Using tcpl, the user can upload, process and retrieve data by connecting to a MySQL database. The package includes a SQL file to initialize the MySQL database on the user’s server of choice. Additionally, the MySQL version of the ToxCast database containing all the publicly available ToxCast data is available for download at: https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data.

B. Package Settings

First, it is highly recommended for users to utilize the data.table package. The tcpl package utilizes the data.table package for all data frame-like objects.

library(data.table)
library(tcpl)
## Store the path for the tcpl directory for loading data
pkg_dir <- system.file(package = "tcpl")

Every time the package is loaded in a new R session, a message similar to the following will print showing the default package settings:

tcpl (v1.3) loaded with the following settings:
  TCPL_DB:    C:/Users/user/R-3.4.4/library/tcpl/csv
  TCPL_USER:  NA
  TCPL_HOST:  NA
  TCPL_DRVR:  tcplLite
Default settings stored in TCPL.conf. See ?tcplConf for more information.

The package consists of five settings: (1) $TCPL_DB points to the tcpl database (either the path to the CSV directory, as in the given example above, or the name of the MySQL database), (2) $TCPL_USER stores the username for accessing the database, (3) $TCPL_PASS stores the password for accessing the database, (4) $TCPL_HOST points to the MySQL server host, and (5) $TCPL_DRVR indicates which database driver to use (either “MySQL” or “tcplLite”).

Refer to ?tcplConf for more information. At any time, users can check the settings using tcplConfList() . An example of database settings using tcpl would be as follows:

tcplConf(drvr = "MySQL", 
         user = "username", 
         pass = "password", 
         host = "localhost",
         db   = "invitrodb")

In addition, the settings can be adjusted to the tcplLite driver:

tcplConf(drvr = "tcplLite", db = system.file("csv", package = "tcpl"), user = "", pass = "", host = "")

The examples illustrated in this vignette use tcplLite for data uploading and processing. Therefore, the user does not need to change the initialzed settings. For data retrieval, we will provide examples for accessing data from the tcpl database using a MySQL connection (tcpl), and csv files (tcplLite).

Note, tcplSetOpts will only make changes to the parameters given. The package is always loaded with the settings stored in the TCPL.config file located within the package directory. The user can edit the file, such that the package loads with the desired settings, rather than having to call the tcplSetOpts function every time. The TCPL.config file has to be edited whenever the package is updated or re-installed.

C. Assay Structure

The definition of an “assay” is, for the purposes of this package, broken into:

assay_source – the vendor/origination of the data

assay – the procedure to generate the component data

assay_component – the raw data readout(s)

assay_component_endpoint – the normalized component data

Each assay element is represented by a separate table in the tcpl database. In general, we refer to an “assay_component_endpoint” as an “assay endpoint.” As we move down the hierarchy, each additional layer has a one-to-many relationship with the previous layer. For example, an assay component can have multiple assay endpoints, but an assay endpoint can derive only from a single assay component.

All processing occurs by assay component or assay endpoint, depending on the processing type (single-concentration or multiple-concentration) and level. No data are stored at the assay or assay source level. The “assay” and “assay_source” tables store annotations to help in the processing and down-stream understanding/analysis of the data. For more information about the assay annotations and the ToxCast assays, please refer to https://www.epa.gov/chemical-research/toxicity-forecasting.

Throughout the package, the levels of assay hierarchy are defined and referenced by their primary keys (IDs) in the tcpl database: $\mathit{asid}$ (assay source ID), $\mathit{aid}$ (assay ID), $\mathit{acid}$ (assay component ID), and $\mathit{aeid}$ (assay endpoint ID). In addition, the package abbreviates the fields for the assay hierarchy names. The abbreviations mirror the abbreviations for the IDs with “nm” in place of “id” in the abbreviations, e.g. assay_component_name is abbreviated $\mathit{acnm}$.

Appendix A: Field Explanation/Database Structure

This appendix contains reference tables that describe the structure and table fields found in the tcpl database. The first sections of this appendix describe the data-containing tables, followed by a section describing the additional annotation tables.

In general, the single-concentration data and accompanying methods are found in the “sc#” tables, where the number indicates the processing level. Likewise, the multiple-concentration data and accompanying methods are found in the “mc#” tables. Each processing level that has accompanying methods will also have tables with the “_methods” and “_id” naming scheme. For example, the database contains the following tables: “mc5” storing the data from multiple-concentration level 5 processing, “mc5_methods” storing the available level 5 methods, and “mc5_aeid” storing the method assignments for level 5. Note, the table storing the method assignments for level 2 multiple-concentration processing is called “mc2_acid”, because MC2 methods are assigned by assay component ID.

There are two additional tables, “sc2_agg” and “mc4_agg,” that link the data in tables “sc2” and “mc4” to the data in tables “sc1” and “mc3,” respectively. This is necessary because each entry in the database before SC2 and MC4 processing represents a single value; subsequent entries represent summary/modeled values that encompass many values. To know what values were used in calculating the summary/modeled values, the user must use the “_agg” look-up tables.

Each of the methods tables have fields analogous to $\mathit{mc5\_mthd\_id}$, $\mathit{mc5\_mthd}$, and $\mathit{desc}$. These fields represent the unique key for the method, the abbreviated method name (used to call the method from the corresponding mc5_mthds function), and a brief description of the method, respectively. The “mc6_methods” table may also include $\mathit{nddr}$ field. More information about $\mathit{nddr}$ is available in the discussion of multiple-concentration level 6 processing.

The method assignment tables will have fields analogous to $\mathit{mc5\_mthd\_id}$ matching the method ID from the methods tables, an assay component or assay endpoint ID, and possibly an $\mathit{exec\_ordr}$ field indicating the order in which to execute the methods.

The method and method assignment tables will not be listed in the tables below to reduce redundancy.

Many of the tables also include the $\mathit{created\_date}$, $\mathit{modified\_date}$, and $\mathit{modified\_by}$ fields that store helpful information for tracking changes to the data. These fields will not be discussed further or included in the tables below.

Many of the tables specific to the assay annotation are not utilized by the tcpl package. The full complexity of the assay annotation used by the ToxCast program is beyond the scope of this vignette and the tcpl package. More information about the ToxCast assay annotation can be found at: https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data.

A. Single-concentration data-containing tables

Field	Description
Table 5: Fields in sc0 table.
s0id	Level 0 ID
acid	Assay component ID
spid	Sample ID
cpid	Chemical plate ID
apid	Assay plate ID
rowi	Assay plate row index
coli	Assay plate column index
wllt	Well type†
wllq	1 if the well quality was good, else 0‡
conc	Concentration is micromolar
rval	Raw assay component value/readout from vendor
srcf	Filename of the source file containing the data
†Information about the different well types is available in Appendix B.

Field	Description
Table 6: Fields in sc1 table.
s1id	Level 1 ID
s0id	Level 0 ID
acid	Assay component ID
aeid	Assay component endpoint ID
logc	Log base 10 concentration
bval	Baseline value
pval	Positive control value
resp	Normalized response value

Field	Description
Table 7: Fields in sc2_agg table.
aeid	Assay component endpoint ID
s0id	Level 0 ID
s1id	Level 1 ID
s2id	Level 2 ID

Field	Description
Table 8: Fields in sc2 table.
s2id	Level 2 ID
aeid	Assay component endpoint ID
spid	Sample ID
bmad	Baseline median absolute deviation
max_med	Maximum median response value
hitc	Hit-/activity-call, 1 if active, 0 if inactive
coff	Efficacy cutoff value
tmpi	Ignore, temporary index used for uploading purposes

B. Multiple-concentration data-containing tables

The “mc0” table, other than containing $\mathit{m0id}$ rather than $\mathit{s0id}$, is identical to the “sc0” described in the section above.

Field	Description
Table 9: Fields in mc1 table.
m1id	Level 1 ID
m0id	Level 0 ID
acid	Assay component ID
cndx	Concentration index
repi	Replicate index

Field	Description
Table 10: Fields in mc2 table.
m2id	Level 2 ID
m0id	Level 0 ID
acid	Assay component ID
m1id	Level 1 ID
cval	Corrected value

Field	Description
Table 11: Fields in mc3 table.
m3id	Level 3 ID
aeid	Assay endpoint ID
m0id	Level 0 ID
acid	Assay component ID
m1id	Level 1 ID
m2id	Level 2 ID
bval	Baseline value
pval	Positive control value
logc	Log base 10 concentration
resp	Normalized response value

Field	Description
Table 12: Fields in mc4_agg table.
aeid	Assay endpoint ID
m0id	Level 0 ID
m1id	Level 1 ID
m2id	Level 2 ID
m3id	Level 3 ID
m4id	Level 4 ID

Field	Description
Table 13: Fields in mc4 table (Part 1).
m4id	Level 4 ID
aeid	Assay endpoint ID
spid	Sample ID
bmad	Baseline median absolute deviation
resp_max	Maximum response value
resp_min	Minimum response value
max_mean	Maximum mean response value
max_mean_conc	Log concentration at max_mean
max_med	Maximum median response value
max_med_conc	Log concentration at max_med
logc_max	Maximum log concentration tested
logc_min	Minimum log concentration tested
cnst	1 if the constant model converged, 0 if it failed to converge, N/A if series had less than four concentrations
hill	1 if the Hill model converged, 0 if it failed to converge, N/A if series had less than four concentrations or if max_med < 3bmad
hcov	1 if the Hill model Hessian matrix could be inverted, else 0
gnls	1 if the gain-loss model converged, 0 if it failed to converge, N/A if series had less than four concentrations or if max_med < 3bmad
gcov	1 if the gain-loss model Hessian matrix could be inverted, else 0
cnst_er	Scale term for the constant model
cnst_aic	AIC for the constant model
cnst_rmse	RMSE for the constant model
cnst_prob	Probability the constant model is the true model
hill_tp	Top asymptote for the Hill model
hill_tp_sd	Standard deviation for hill_tp
hill_ga	AC₅₀ for the Hill model
hill_ga_sd	Standard deviation for hill_ga

Field	Description
Table 14: Fields in mc4 table (Part 2).
hill_gw	Hill coefficient
hill_gw_sd	Standard deviation for hill_gw
hill_er	Scale term for the Hill model
hill_er_sd	Standard deviation for hill_er
hill_aic	AIC for the Hill model
hill_rmse	RMSE for the Hill model
hill_prob	Probability the Hill model is the true model
gnls_tp	Top asymptote for the gain-loss model
gnls_tp_sd	Standard deviation for gnls_tp
gnls_ga	AC₅₀ in the gain direction for the gain-loss model
gnls_ga_sd	Standard deviation for gnls_ga
gnls_gw	Hill coefficient in the gain direction
gnls_gw_sd	Standard deviation for gnls_gw
gnls_la	AC₅₀ in the loss direction for the gain-loss model
gnls_la_sd	Standard deviation for gnls_la
gnls_lw	Hill coefficient in the loss direction
gnls_lw_sd	Standard deviation for the gnls_lw
gnls_er	Scale term for the gain-loss model
gnls_er_sd	Standard deviation for gnls_er
gnls_aic	AIC for the gain-loss model
gnls_rmse	RMSE for the gain-loss model
gnls_prob	Probability the gain-loss model is the true model
nconc	Number of concentrations tested
npts	Number of points in the concentration series
nrep	Number of replicates in the concentration series
nmed_gtbl	Number of median values greater than 3bmad
tmpi	Ignore, temporary index used for uploading purposes

Field	Description
Table 15: Fields in mc5 table.
m5id	Level 5 ID
m4id	Level 4 ID
aeid	Assay endpoint ID
modl	Winning model: “cnst”, “hill”, or “gnls”
hitc	Hit-/activity-call, 1 if active, 0 if inactive, -1 if cannot determine
fitc	Fit category
coff	Effcacy cutoff value
actp	Activity probability (1 - const_prob)
modl_er	Scale term for the winning model
modl_tp	Top asymptote for the winning model
modl_ga	Gain AC₅₀ for the winning model
modl_gw	Gain Hill coefficient for the winning model
modl_la	Loss AC₅₀ for the winning model
modl_lw	Loss Hill coefficient for the winning model
modl_prob	Probability for the winning model
modl_rmse	RMSE for the winning model
modl_acc	Activity concentration at cutoff for the winning model
modl_acb	Activity concentration at baseline for the winning model
modl_ac10	AC10 for the winning model

Field	Description
Table 16: Fields in mc6 table.
m6id	Level 6 ID
m5id	Level 5 ID
m4id	Level 4 ID
aeid	Assay endpoint ID
m6_mthd_id	Level 6 method ID
flag	Text output for the level 6 method
fval	Value from the flag method, if applicable
fval_unit	Units for fval , if applicable

Field	Description
Table 17: Fields in mc7 table.
m4id	Level 4 ID
Aeid	Assay endpoint ID
Aenm	Assay endpoint name
Asid	Assay source ID
Acid	Assay component ID
Hit_pct	Total percent of hit calls made after 1000 bootstraps
Total_hitc	Total number of hit calls made after 1000 bootstraps
Modl_ga_min	Low bound of the 95% confidence interval for the AC₅₀
Modl_ga_max	Upper bound of the 95% confidence interval for the AC₅₀
Modl_ga_med	Median AC₅₀ after 1000 bootstraps
Modl_gw_med	Median gain Hill coefficient for 1000 bootstraps
Modl_ga_delta	AC₅₀ confidence interval width in log units
Cnst_pct	Percent of 1000 bootstraps that the constant model was selected as the winning model
Hill_pct	Percent of 1000 bootstraps that the Hill model was selected as the winning model
Gnls_pct	Percent of 1000 bootstraps that the gain-loss was selected as the winning model

C. Auxiliary annotation tables

As mentioned in the introduction to this appendix, a full description of the assay annotation is beyond the scope of this vignette. The fields pertinent to the tcpl package are listed in the tables below.

Field	Description
Table 18: List of annotation tables.
assay	Assay-level annotation
assay_component	Assay component-level annotation
assay_component_endpoint	Assay endpoint-level annotation
assay_component_map	Assay component source names and their corresponding assay component ids
assay_reagent**	Assay reagent information
assay_reference**	Map of citations to assay
assay_source	Assay source-level annotation
chemical	List of chemicals and associated identifiers
chemical_library	Map of chemicals to different chemical libraries
citations**	List of citations
gene	Gene** identifers and descriptions
intended target**	Intended assay target at the assay endpoint level
mc5_fit_categories	The level 5 fit categories
organism**	Organism identifiers and descriptions
sample	Sample ID information and chemical ID mapping
technological_target**	Technological assay target at the assay component level
** indicates tables not currently used by the tcpl package

Field	Description
Table 19: Fields in assay.
aid	Assay ID
asid	Assay source ID
assay_name	Assay name (abbreviated “anm” within the package)
assay_desc	Assay description
timepoint_hr	Treatment duration in hours
assay_footprint	Microtiter plate size†
† discussed further in the “Register and Upload New Data” section

Field	Description
Table 20: Fields in assay_component.
acid	Assay component ID
aid	Assay ID
assay_component_name	Assay component name (abbreviated “acnm” within the package)
assay_component_desc	Assay component description

Field	Description
Table 21: Fields in assay_source.
asid	Assay source ID
assay_source_name	Assay source name (typically an abbreviation of the assay_source_long_name, abbreviated “asnm” within the package)
assay_source_long_name	The full assay source name
assay_source_description	Assay source description

Field	Description
Table 22: Fields in assay_component_endpoint.
aeid	Assay component endpoint ID
acid	Assay component ID
assay_component_endpoint_name	Assay component endpoint name (abbreviated “aenm” within the package)
assay_component_endpoint_desc	Assay component endpoint description
export_ready	0 or 1, used to flag data as “done”
normalized_data_type	The units of the normalized data†
burst_assay	0 or 1, 1 indicates the assay results should be used in calculating the burst z-score
fit_all	0 or 1, 1 indicates all results should be fit, regardless of whether the max_med surpasses 3bmad
† discussed further in the “Register and Upload New Data” section

Field	Description
Table 23: Fields in assay_component_map table.
acid	Assay component ID
acsn	Assay component source name

Field	Description
Table 24: Fields in chemical table.
chid	Chemical ID†
casn	CAS Registry Number
chnm	Chemical name
† this is the DSSTox GSID within the ToxCast data, but can be any integer and will be auto-generated (if not explicitly defined) for newly registered chemicals

Field	Description
Table 25: Fields in chemical_library table.
chid	Chemical ID
clib	Chemical library

Field	Description
Table 26: Fields in mc5_fit_categories.
fitc	Fit category
parent_fitc	Parent fit category
name	Fit category name
xloc	x-axis location for plotting purposes
yloc	y-axis location for plotting purposes

Field	Description
Table 27: Fields in sample table.
spid	Sample ID
chid	Chemical ID
stkc	Stock concentration
stkc_unit	Stock concentration unit
tested_conc_unit	The concentration unit for the concentration values in the data-containing tables
spid_legacy	A place-holder for previous sample ID strings

The stock concentration fields in the “sample” table allow the user to track the original concentration when the neat sample is solubilized in vehicle before any serial dilutions for testing purposes.

Appendix B: Level 0 Pre-processing

Level 0 pre-processing can be done on virtually any high-throughput/high-content screening application. In the ToxCast program, level 0 processing is done in R by vendor/dataset-specific scripts. The individual R scripts act as the “laboratory notebook” for the data, with all pre-processing decisions clearly commented and explained.

Level 0 pre-processing has to reformat the raw data into the standard format for the pipeline, and also can make manual changes to the data. All manual changes to the data should be very well documented with justification. Common examples of manual changes include fixing a sample ID typo, or changing well quality value(s) to 0 after finding obvious problems like a plate row/column missing an assay reagent.

Each row in the level 0 pre-processing data represents one well-assay component combination, containing 11 fields (Table 28). The only field in level 0 pre-processing not stored at level 0 is the assay component source name ($\mathit{acsn}$). The assay component source name should be some concatenation of data from the assay source file that identifies the unique assay components. When the data are loaded into the database, the assay component source name is mapped to assay component ID through the assay_component_map table in the tcpl database. Assay components can have multiple assay component source names, but each assay component source name can only map to a single assay component.

Field	Description	N.A
Table 28: Required felds in level 0 pre-processing.
acsn	Assay component source name	No
spid	Sample ID	No
cpid	Chemical plate ID	Yes
apid	Assay plate ID	Yes
rowi	Assay plate row index, as an integer	Yes
coli	Assay plate column index, as an integer	Yes
wllt	Well type	No
wllq	1 if the well quality was good, else 0	No
conc	Concentration in micromolar	No†
rval	Raw assay component value/readout from vendor	Yes‡
srcf	Filename of the source file containing the data	No
The N/A column indicates whether the field can be N/A in the pre-processed data. †Concentration can be N/A for control values only tested at a single concentration. Concentration cannot be N/A for any test compound (well type of “t”) data. ‡If the raw value is N/A, well type has to be 0.

The well type field is used in the processing to differentiate controls from test compounds in numerous applications, including normalization and definition of the assay noise level. Currently, the tcpl package includes the eight well types in Table 29. Package users are encouraged to suggest new well types and methods to better accommodate their data.

Well.Type	Description
Table 29: Well types
t	Test compound
c	Gain-of-signal control in multiple concentrations
p	Gain-of-signal control in single concentration
n	Neutral/negative control
m	Loss-of-signal control in multiple concentrations
o	Loss-of-signal control in single concentration
b	Blank well
v	Viability control

The final step in level 0 pre-processing is loading the data into the tcpl database. The tcpl package includes the tcplWriteLvl0 function to load data into the database. The tcplWriteLvl0 function maps the assay component source name to the appropriate assay component ID, checks each field for the correct class, and checks the database for the sample IDs with well type “t.” Each test compound sample ID must be included in the tcpl database before loading data. The tcplWriteLvl0 also checks each test compound for concentration values.

Appendix C: Cytotoxicity Distribution

Recognizing the susbtantial impact of cytotoxicity in confounding high-throughput and high-content screening results, the tcpl package includes methodology for defining chemical-specific cytotoxicity estimates. Our observations based on ToxCast data suggest a complex, and not-yet fully understood cellular biology that includes non-specific activation of many targets as cells approach death. For example, a chemical may induce activity in an estrogen-related assay, but if that chemical also causes activity in hundreds of other assays at or around the same concentration as cytotoxicity, should the chemical be called an estrogen agonist? The tcplCytpPt function provides an estimate of chemical-specific cytotoxicity points to provide some context to the “burst” phenomenon.

The cytotoxicity point is simply the median AC$_{50}$ for a set of assay endpoints, either given by the user or defined within the tcpl database. By default, the tcplCytoPt function uses the assay endpoints listed in the $\mathit{burst\_assay}$ field of the “assay_component_endpoint” table, where 1 indicates including the assay endpoint in the calculation. The “burst” assay endpoints can be indentified by running tcplLoadAeid(fld = “burst_assay”, val = 1) .

In addition to the cytotoxicity point, tcplCytoPt provides two additional estimates: (1) the MAD of the AC$_{50}$ ($\mathit{modl\_ga}$) values used to calculate the cytotoxicity point, and (2) the global MAD. Note, only active assay endpoints (where the hit call, $\mathit{hitc}$, equals $1$) are included in the calculations. Once the burst distribution (cytotoxicity point and MAD) is defined for each chemical, the global burst MAD is defined as the median of the MAD values. Not every chemcial may be tested in every “burst” assay, so the user can determine the minimum number of tested assays as a condition for the MAD value for a particular chemical to be included in the global MAD calculation. By default, if “aeid” is the vector of assay endpoints used in the calculation, tcplCytoPt requires the chemical to be tested in at least floor(0.8 * length(aeid)) assay endpoints to be included in the calculation. The user can specify to include all calculated MAD values (note, there must be at least two active assay endpoints to calculate the MAD) by setting ‘min.test’ to FALSE . The ‘min.test’ parameter also accepts a number, allowing the user to explicitly set the requirement.

The global MAD gives an estimate of overall cytotoxicity window, and allows for a cytotoxicity distrubtion to be determined for chemicals with less than two active “burst” assay endpoints. The cytotoxicity point for chemicals with less than two active “burst” endpoints is set to the value given to the ‘default.pt’ parameter. By default, the tcplCytoPt assigns ‘default.pt’ to 3.²

Appendix D: Build Variable Matrices

The tcplVarMat function creates chemical-by-assay matrices for the level 4 and level 5 data. When multiple sample-assay series exist for one chemical, a single series is selected by the tcplSubsetChid function. See ?tcplSubsetChid for more information.

“modl_ga” – The $\log_{10}\mathit{AC_{50}}$ (in the gain direction) for the winning model.
“hitc” – The hit call for the winning model.
“m4id” – The m4id, listing the concentration series selected by tcplSubsetChid .
“zscore” – The z-score (described below).
“tested_sc” – $1$ or $0$, $1$ indicating the chemical/assay pair was tested in the single-concentration format.
“tested_mc” – $1$ or $0$, $1$ indicating the chemical/assay pair was tested in the multiple-concentration format.
“ac50” – a modified AC$_{50}$ table (in non-log units) where assay/chemical pairs that were not tested, or tested and had a hit call of $0$ or $-1$ have the value $1e6$.
“neglogac50” – $-\log_{10}\frac{\mathit{AC_{50}}}{1e6}$ where assay/chemical pairs that were not tested, or tested and had a hit call of $0$ or $-1$ have the value $0$.

The z-score calculation is based on the output from tcplCytoPt (Appendix C), and is calculated for each AC$_{50}$ value as follows: \[ \mathit{z-score} = -\frac{\mathit{modl\_ga} - \mathit{cyto\_pt}}{\mathit{global\_mad}}\mathrm{,} \] Note: the burst z-score values are multiplied by -1 to make values that are more potent relative to the burst distribution a higher positive z-score.

In addition, the standard matrices, additional matrices can be defined by the ‘add.vars’ parameter in the tcplCytoPt function. The ‘add.vars’ function will take any level 4 or level 5 field and create the respective matrix.

Appendix E: Curve Fitting Uncertainty

The level 7 in the tcpl package provides a quantitative metric for assessing the uncertainty in identifying active chemicals in high-throughput chemical screening assays. Consequently, the pipeline will be able to better identify false positive, and false negative hit calls. This level uses the bootstrapping method used in the toxboot R package. Briefly, toxboot uses a smooth nonparametric bootstrap resampling to add random normally distributed noise to give a resampled set of concentration-response values. The resampled data are fit to the three ToxCast models, and repeated 1000 times. The resulting data are used to generate point estimates, the winning model, and a hit call for each of the 1000 resamples. Various summary statistics, such as hit percent, median AC$_{50}$, and AC$_{50}$ confidence interval are generated based on the toxboot resampling. For more information on the toxboot package, the user can access the documentation available on the CRAN depository.³

The tcpl package provides a platform for identifying active chemicals from >1000 chemical-assay endpoint pairs. The curve fits for these pairs are prone to variabilities in the calculated parameters due to biological variance, and model fitting to name a few. The extent of uncertainty in the bioactivity values can propagate errors in estimating toxicological endpoints used for risk assessment. The quantitative strategy used in level 7 to assess the uncertainty in the fitted bioactivity values, such as AC$_{50}$, can increase the predictive power of in vitro toxicity prediction. The addition of level 7 allows the user to calculate the hit percent, a probabilistic metric for determining the hit call for a chemical-assay endpoint pair ranging from 0 to 1. Instead of the binary hit call, 0 or 1, the hit percent has a higher sensitivity to noise and artefacts in the fitted data. Level 7 data can be retrieved from the MySQL database using the tcplLoadData function. The user will need the m4id(s) corresponding to the sample(s) of interest.

https://www.epa.gov/chemical-research/toxicity-forecasting ↩︎
$10^3 = 1000$, therefore, when using micromolar units, $3$ is equivalent to $1$ millimolar. $1$ millimolar was chosen as an arbitrary high concentration (outside the testing range for ToxCast data), based on the principle that all compounds are toxic if given in high enough concentration.↩︎
https://CRAN.R-project.org/package=toxboot ↩︎

The ToxCast(TM) Analysis Pipeline(tcpl)
An R Package for Processing and Modeling Chemical Screening Data (Version 2.0)

National Center for Computational Toxicology, US EPA