replacer
, a value replacement utility currently based on package data.table
, is intended for outside-database update of data. It requires the preparation of a lookup file which is a list of replacement requests with and, in special circumstances, without an index column.
The input and output data files are comma-separated values file format (csv). Various other file formats can therefore be processed upon conversion.
Data processing such as cleanup, file format conversion and the appendage of new data are outside the scope of this utility.
There could be several lookup files associated with one data file and vice versa which will be batch-processed.
This vignette defines the terms related to replacement and to the elements of the lookup file, and details the procedure of creating an efficient lookup file suggesting ways to maintain it relevant for future updates. It continues by introducing the two User-intended functions, for single file and batch file processing and describes their internal workflow. Finally, it presents a screen output result obtained with a lookup without index and introduces the datasets that are used for examples.
This utility is accessible to beginners to R
working with outside-database files. It facilitates complex dataset updates with minimal prompt time by employing User-friendly functions which automatically follow a decision tree rooted in User’s input.
The choice for data.table
is motivated by its concise and sound code, effectiveness and efficiency in processing large and complex datasets, and its independence on packages, or verses, other than base R
itself; this utility will only install data.table
on User’s machine had it had not been installed already.
Hence, familiarity with basic R
commands is necessary. Familiarity with data.table
is not, yet strongly recommended!
The following terms are specific to this utility:
lookup table: a csv format data file containing at least 4 standard columns with standard column names:
simple (or 1:1) replacement: one value from newVals replaces one value from oldVals
multiple (or 1:many) replacement: one value from newVals replaces identical values from oldVals
split: in this context, recursive partition of multiple replacements into simple and multiple replacements
generic value: restricted replacement value used for multiple replacements of missing data
involved column: data column with the name listed in vars column of lookup
request: implicit information containing value type, value missingness, value uniqueness, value location, replacement type, replacement value - in one row, spanning the standard columns of lookup
Lookup may contain a combination of simple and multiple replacements, simple replacements only, multiple replacements only (i.e. missing only, duplicated values only, or mixed) as needed. Multiple replacements can be split as needed.
On special circumstances, column “id” may be absent from lookup while the presence of the other three standard columns is mandatory. Modifying standard columns’ names will result in error. Any extra columns in lookup will be automatically removed during internal conformance checks on lookup.
Multiple replacements apply to missing or duplicated non-missing values. While replacement values of duplicated values could be regarded as generic, they are entirely decided by the User. They are not restricted in this sense.
In this version, replacements of missing data are restricted to words formed with the word roots MIS|mis, PRE|pre, UN|un, ABS|abs and the adverbs YES|yes, NO|no for missing values of type character and with a number equal to- or larger than 3 integer 9s (i.e. 999, 9999 etc.) for missing values of double or integer types.
Short sentences beginning with any word root or adverb above may be accepted.
Typical data replacement workflow: the initial data file and the corresponding lookup file (here named “data.csv” and “lookup.csv”) are saved in a directory, here named dir. Once the package is installed, the following commands should be typed at the R
prompt:
> require(replacer)
> dir = 'C:/path/to/directory'
> replaceVals(dir) ## to update the data
Results will show on screen (Section “Screen Output Example” presents the output of a different data/lookup pair). A csv file similar to Out-Data is saved on dir.
Rw | a | b | uninvlvd |
---|---|---|---|
1 | aa | 7.2 | 0.6 |
2 | 17.6 | 0.1 | |
3 | cc | 1.6 | 1.6 |
4 | dd | 0.8 | 1.2 |
5 | ee | -14.0 | 5.0 |
6 | cc | 0.1 | |
7 | 1.6 | ||
8 | -5.0 | 3.3 | |
9 | mm | 1.6 | 9.6 |
10 | 5.2 | ||
11 | |||
12 | cc |
Rw | vars | oldVals | newVals | id | source |
---|---|---|---|---|---|
1 | a | ZZ | 2 | source1 | |
2 | a | CC | 8 | source2 | |
3 | a | cc | DD | 0 | source3 |
4 | a | mm | UU | 9 | source4 |
5 | a | cc | CA | 12 | source5 |
6 | a | YES | source6 | ||
7 | b | 1.607 | 1.1 | 0 | source7 |
8 | b | 1.607 | 8.8 | 7 | source8 |
9 | b | 999 | source9 |
Rw | a | b | uninvlvd |
---|---|---|---|
1 | aa | 7.2 | 0.6 |
2 | ZZ | 17.6 | 0.1 |
3 | DD | 1.1 | 1.6 |
4 | dd | 0.8 | 1.2 |
5 | ee | -14.0 | 5.0 |
6 | DD | 0.1 | |
7 | YES | 8.8 | |
8 | CC | -5.0 | 3.3 |
9 | UU | 1.1 | 9.6 |
10 | YES | 5.2 | |
11 | YES | 999.0 | |
12 | CA | 999.0 |
This small example covers the complete set of lookup requests with index present. Column “Rw” is here for convenience only. It should not be manually created.
The left-hand side table, In-Data, contains 3 columns of data. The two columns (“a” and “b”), involved in replacements, are of character respectively, of numeric data types and both contain missing, duplicates and unique values. The “uninvlvd” column in In-Data is not listed in lookup’s vars therefore, not involved in replacements and remains unchanged in Out-Data. Non-standard column source in Lookup is not standard hence, is automatically removed.
The Out-Data result requires a User-made lookup file with index; the replacement requests are as follows:
A User-made index may or, may not be necessary in lookup:
A User-made index is the safest data value replacement procedure; it requires little insight in the structure of the data and insures fast review of large datasets. When the full range of request types including splitting are present, the index should contain the types of values presented in the above example, in the lookup column id.
There could be several distinct sets of duplicated values in any data column. If the purpose is multiple replacements of any/all of these sets (or subsets of these sets) then, for each subset the replacement request should take only one row in lookup, having 0 as index value. Corresponding replacement values are at User’s discretion.
Same rules as above apply to multiple replacement of sets of missing values with two differences: 1) the corresponding index value is empty or NA, and 2) the generic value is restricted to a word or short sentence starting with any word root or adverb listed in “Definitions.” For generic values of numeric type, data type preservation (see Note 2) is not required (i.e. it is not required to enter 9999.0 instead of 9999 in the id row corresponding to missing values of double data type).
Further versions may bring more generic value customization if necessary.
To request simple replacements, whether having unique, duplicated or missing values in oldVals column, the index should contain the corresponding row number in the data file. Sorting lookup by vars would maintain focus on one data column at a time.
Standard columns oldVals and newVals in lookup, revert to the character data type upon completion since they contain combined numeric and character values. This behavior is normal.
In conclusion, a User-made index covers all types of replacements, including splits.
There are special circumstances when the index is absent in lookup:
There is no need for index when lookup contains only multiple replacement requests of missing and/or duplicated values (Section “Example Datasets”). An extension of this case regarding splitting on missing values is explained in the sub-section below.
This case requires a deeper insight in the structure of the data file; when used standalone, helper function whichDups()
, part of this utility, finds the duplicated values in all columns of the data file.
The requirements for an effective lookup table without index are:
rows in lookup must have the same succession the values set for replacement have in any involved data column
simple replacements: if values set for replacement within any involved data column are unique (either missing or not), simple replacements without index are possible
mixed simple and multiple replacements which do not undergo splitting on any involved data column: if any involved data column contains unique, duplicated and/or missing values and some or all of the duplicated and/or missing values are set for multiple replacements, replacements without index are possible on the unique values (missing or not)
splitting: not implemented for duplicated values. For missing values splitting is allowed (see Note 3)
Note 1 Whether the index is present or not in lookup, at the time of run the row order in the data file is identical to the row order in the data file at the completion of lookup. If necessary, an index named other than “id” could be built and preserved inside the dataset. Re-ordering the dataset by this index before run will retrieve the row order at the completion of lookup and will keep lookup relevant for future updates upon completion with new requests.
Note 2 User should preserve the data type between values set for replacement and their replacements (i.e. integer/double to integer/double etc.). If an error message similar to “Item 2 is type integer but the first item is type double. Please coerce …” appears on screen, it means that 1) the requirement was violated and that 2) the helper function con2fcoales()
, part of this utility, was unable to execute the necessary conversion. In such cases, investigation should start with the output messages and comments followed by data and the associated lookup.
Note 3 Due to their special type, splitting on missing values across data columns is allowed when index is absent, i.e. if one involved data column contains one missing value and other involved columns contain missing values set (or not) for multiple replacements, splitting is allowed and the unique missing value can be replaced with a non-generic replacement value.
There are two functions designed for direct use and set to display a series of messages and comments informing the User on the computational path taken and on findings along the way. While the messages can be turned off, the comments remain visible.
Single file replacements are processed with function replaceVals()
. When data and lookup files are named simply “data.csv” and “lookup.csv” the function only requires the path to the directory where these files are stored, written as length 1 quoted character, with forward “/” or, double backward “\\” slash and without end slash.
At start, this function performs a series of conformance checks on lookup, excludes non-standard columns then separates the data into involved and uninvolved columns, rejoining them on exit.
Batch file replacements are processed with function bReplace()
. When messages are set to TRUE it offers an almost full range of messages/comments for each data/lookup input pair and request type. During run, bReplace()
calls replaceVals()
function as many times as data/lookup pairs are on the list, displaying a named list of messages, comments, updated data and counts for each input pair.
Both functions save their updated data to the directory above, in csv file format.
Helper function sReplace()
(described in Manual) is the data replacement workhorse for this utility. When called by replaceVals()
, it firstly checks for index presence in lookup. Upon the result, the function moves along the branches of a decision tree:
The function starts by identifying duplicated and/or missing values within the involved columns as well as eventual splits on missing values.
If lookup requires multiple and simple replacements, the function separates lookup into 3 subsets: 1) of multiple replacements for duplicated values, for which it later creates an internal index, 2) multiple replacements for missing values for which an internal index is not necessary, and 3) the remainder subset containing unique values, including unique missing values, for which it also creates an internal index.
In case of splitting on duplicated values with no index the function stops with an error.
The function creates an internal index of row numbers corresponding to all elements of distinct subsets of duplicated values found within each involved data column and loops the function data.table::set()
for replacements.
As already mentioned, no index is created for multiple replacements of missing values as there is only one generic value per data column (this utility uses no information on missing data beside the data type. For a different approach on this subject the User is directed to data imputation literature).
The subset of missing values is then reshaped, and the columns are coalesced (please check data.table
Manual for these terms) with corresponding data columns, for each generic value present in lookup.
As stated above, simple replacements of unique values without User-made index are possible. Once the internal index is created, this subset is reshaped, joined with the data on index and then, the corresponding columns are coalesced.
The function subsets the lookup using the special index values 0 and/or NA (or empty). At maximum, 3 subsets are formed as above. The replacement process is similar with the process used for absent index with the difference that unique values already have the User-made index of row numbers.
In conclusion, whether single or batch file processing, all request types are processed in one run while the User monitors the internal workflow.
Presented below is an update of the “Chile” dataset from package carData
.
At the R
prompt, type or just copy/paste the commands:
help(Chile, package = "carData")
require(replacer)
dir = system.file("extdata", package = "replacer")
replaceVals(dir, 'chile.csv', 'chile_nadup.csv', save = FALSE)
reading data from: C:/R/R-4.1.0/library/replacer/extdata ...
completed reading data ...
checking standard columns in lookup ...
starting replacements ...
found no request for 1:1 replacements ...
searching for 1:many replacements ...
found request for multiple duplicated value replacments: creating index ...
replacing multiple duplicated values ...
found request for multiple missing value replacements: replacing ...
helper function has completed!
$` updated_chile_using_chile_nadup`
region population sex age education income statusquo vote
1: North 200000 Male 65 Primary 35000 1.00820 Y
2: North 200000 Male 29 Post-secondary 7500 -1.29617 N
3: North 200000 Female 38 Primary 15000 1.23072 Y
4: North 200000 Female 49 Primary 35000 -1.03163 N
5: North 200000 Female 23 Secondary 35000 -1.10496 N
---
2696: Metro Santiago 15000 Male 42 Primary 15000 -1.26247 N
2697: Metro Santiago 15000 Female 28 Primary 15000 1.32950 Y
2698: Metro Santiago 15000 Female 44 Primary 75000 1.42045 Y
2699: Metro Santiago 15000 Male 21 Secondary 75000 0.18315 PREFERS NO ANSWER
2700: Metro Santiago 15000 Male 20 Post-secondary 35000 1.38179 Y
$` multiple_dups_repl_counts`
vars oldVals newVals education population region sex
1: education P Primary 1107 NA NA NA
2: education PS Post-secondary 462 NA NA NA
3: education S Secondary 1120 NA NA NA
4: population 175000 200000 NA 140 NA NA
5: population 25000 50000 NA 360 NA NA
6: region C Central NA NA 600 NA
7: region M Metro Santiago NA NA 100 NA
8: region N North NA NA 322 NA
9: region S South NA NA 718 NA
10: region SA City of Santiago NA NA 960 NA
11: sex F Female NA NA NA 1379
12: sex M Male NA NA NA 1321
$` multiple_NAs_repl_counts`
region population sex age education income statusquo vote
0 0 0 1 11 98 17 168
This update replaced the abbreviations used in the source data with full names and complete words, as well as, updated the population totals on few regions with fictitious values. For exemplification, missing values in column vote were replaced by a short sentence using the word root “PRE” listed above.
The requests were processed through a lookup table without User-made index.
The first output block named “updated_chile_using_chile_nadup” displays the head and the tail of the updated data file.
To view the complete updated dataset on screen, type:
upData = replaceVals(dir, 'chile.csv', 'chile_nadup.csv', save = FALSE)[[1]]
View(upData)
Next block, named “multiple_dups_repl_counts,” displays a count of duplicated value updates, processed on each involved column, by value and replacement value. In this example, the left-hand side of this block contains the complete cases of lookup.
Finally, the third block named “multiple_NAs_repl_counts” shows counts of missing values replaced by generic values within respective involved columns. In other cases, unrequested replacements within any of the involved columns will be commented in the screen output (see Examples replaceVal()
).
All block names change upon input and request type; reading these names shows what was processed and what was not. In case of error, the displayed messages/comments should hint of where in the process the error occurred.
The datasets in this package cover situations encountered in practice. They contain fictitious values inserted for exemplification purpose with the exception of “chile” dataset, a copy of the “Chile” dataset from package carData
; some value replacements within the associated lookup files are also fictitious.
The batch-file example (function bReplace()
) includes data and associated lookup files presented below. Duplicated values and counts of missing values are shown for all these datasets. Column name indexing appears when several distinctive subsets of duplicated values are present in respective column.
|
|
||||||||||||||||||||
Associated lookup files:
“lookup_id.csv”: mixed simple/multiple replacements of unique, duplicated and missing data, similar to the first example. The file contains standard columns only.
“lookup_idsimple.csv”: simple replacements without 0 or NA values in “id.” The file contains standard columns only.
|
|
||||||||||||||||||||||||
Associated lookup file:
“chile_id.csv”: splits of multiple replacements of duplicated values. Also, multiple replacements of missing data. The file includes non-standard column “source” .
|
|
||||||||||||||||||||||||
Associated lookup file:
“lookup.csv”: mixed simple and multiple replacements of unique, duplicated and missing values. Split on duplicated value “cc” as illustration of the importance of row succession in replacement. The file contains standard columns only.
|
|
||||||||||||
Associated lookup file:
“lookup_unique”: split across columns on missing values. The file contains standard columns only.
“lookupDUP,” “lookupNA”: multiple replacements only of duplicated or missing values in “data_id” dataset. Both files contain standard columns only.
“chile_nadup”: multiple replacements only of duplicated and missing values in “chile” dataset. The file contains non-standard column “source” .
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19043)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] replacer_1.0.1 kableExtra_1.3.4 knitr_1.33 data.table_1.14.0
#>
#> loaded via a namespace (and not attached):
#> [1] rstudioapi_0.13 xml2_1.3.2 magrittr_2.0.1 munsell_0.5.0
#> [5] rvest_1.0.1 viridisLite_0.4.0 colorspace_2.0-2 R6_2.5.1
#> [9] rlang_0.4.11 fastmap_1.1.0 highr_0.9 stringr_1.4.0
#> [13] httr_1.4.2 tools_4.1.0 webshot_0.5.2 xfun_0.25
#> [17] jquerylib_0.1.4 systemfonts_1.0.2 htmltools_0.5.2 yaml_2.2.1
#> [21] digest_0.6.27 lifecycle_1.0.0 formatR_1.11 sass_0.4.0
#> [25] glue_1.4.2 evaluate_0.14 rmarkdown_2.10 stringi_1.7.4
#> [29] compiler_4.1.0 bslib_0.2.5.1 scales_1.1.1 svglite_2.0.0
#> [33] jsonlite_1.7.2