Introduction to asciiSetupReader

Jacob Kaplan

2021-02-03

Some (usually older) data sets are only available in fixed-width ASCII files (.txt or .dat) that have an .sps (SPSS) or .sas (SAS) setup file explaining to the software how to read that file. This package allows you to read in the data if you have both the fixed-width file and its accompanying setup file. These parameters data and setup_file are the only ones requires to run the package though three optional parameters allow you to customize results.

data - A string containing the name of the data file

setup_file - A string containing the name of the data file

Both files must be in your working directory or the string must contain the path to the file. Below is an example of reading in the example dataset - the original data and setup files can be found here.

Please note that I am only using system.file() here so the vignette builds in the package even not on my own computer. You will not use this in the function. Instead you’d simply input data = "example_data.zip" and setup_file = "example_setup.sps". The data file does not have to be in a zip folder, it is only in a zip folder here to reduce the size of this package. In most cases it will be a .dat or a .txt file.

data <- system.file("extdata", "example_data.zip",
             package = "asciiSetupReader")
setup_file <- system.file("extdata", "example_setup.sps",
             package = "asciiSetupReader")

example <- asciiSetupReader::read_ascii_setup(data = data,
setup_file = setup_file)
example[1:6, 1:4] # Look at first 6 rows and first 4 columns
##   IDENTIFIER_CODE NUMERIC_STATE_CODE ORI_CODE             GROUP
## 1 SHR master file            Alabama  AL00112 Cit 50,000-99,999
## 2 SHR master file            Alabama  AL00112 Cit 50,000-99,999
## 3 SHR master file            Alabama  AL00112 Cit 50,000-99,999
## 4 SHR master file            Arizona  AZ00189       Cit < 2,500
## 5 SHR master file            Arizona  AZ00189       Cit < 2,500
## 6 SHR master file            Arizona  AZ00189       Cit < 2,500

There are three optional parameters: use_value_labels, use_clean_names, and select_columns.

use_value_labels

Fixed-width delimited text files are designed to be as compressed as possible. One way of doing this is having letters or numbers represent values. For example, instead of writing “male” or “female” in a column about gender, it will be “0” or “1” (or “M” and “F”). The setup file gives the actual value of these representations. When the parameter use_value_labels is TRUE (which it is by default) it will give the value labels; otherwise it will give only the representation. This parameter is the most time consuming part of the function so if you have a very large dataset but only a few variables you are interested in, it may be wise to set it as FALSE (or use the parameter select_columns to get only those columns).

example <- asciiSetupReader::read_ascii_setup(data = data,
setup_file = setup_file,
use_value_labels = FALSE)
example[1:6, 1:4] # Look at first 6 rows and first 4 columns
##   IDENTIFIER_CODE NUMERIC_STATE_CODE ORI_CODE GROUP
## 1               6                  1  AL00112     3
## 2               6                  1  AL00112     3
## 3               6                  1  AL00112     3
## 4               6                  2  AZ00189     7
## 5               6                  2  AZ00189     7
## 6               6                  2  AZ00189     7

use_clean_names

Column names are similar to how there are both value representations and value labels for values in a column. The columns may have a non-descriptive name (e.g. V1, V2) or a descriptive one (e.g. CITY, GENDER). When use_clean_names is TRUE (which it is by), the descriptive name is given; otherwise the non-descriptive name is given.

example <- asciiSetupReader::read_ascii_setup(data = data,
setup_file = setup_file,
use_clean_names = FALSE)
example[1:6, 1:4] # Look at first 6 rows and first 4 columns
##                V1      V2      V3                V4
## 1 SHR master file Alabama AL00112 Cit 50,000-99,999
## 2 SHR master file Alabama AL00112 Cit 50,000-99,999
## 3 SHR master file Alabama AL00112 Cit 50,000-99,999
## 4 SHR master file Arizona AZ00189       Cit < 2,500
## 5 SHR master file Arizona AZ00189       Cit < 2,500
## 6 SHR master file Arizona AZ00189       Cit < 2,500

select_columns

This parameter allows you to return only the specific columns you want. It is very useful when dealing with a large file which you only want part of. It accepts 3 inputs: column numbers, the non-descriptive column names, or the descriptive column names - you can only choose one input type, cannot mix them together. To get the column names and numbers, consult with the g documentation.

This gets only the first two columns of data and specifies the columns by number.

example <- asciiSetupReader::read_ascii_setup(data = data,
setup_file = setup_file, 
select_columns = 1:2) # Gets only the first 2 columns
head(example)
##   IDENTIFIER_CODE NUMERIC_STATE_CODE
## 1 SHR master file            Alabama
## 2 SHR master file            Alabama
## 3 SHR master file            Alabama
## 4 SHR master file            Arizona
## 5 SHR master file            Arizona
## 6 SHR master file            Arizona

This gets only the first two columns of data and specifies the columns by descriptive names.

example <- asciiSetupReader::read_ascii_setup(data = data,
setup_file = setup_file, 
select_columns = c("IDENTIFIER_CODE", "NUMERIC_STATE_CODE")) # Gets only the first 2 columns
head(example)
##   IDENTIFIER_CODE NUMERIC_STATE_CODE
## 1 SHR master file            Alabama
## 2 SHR master file            Alabama
## 3 SHR master file            Alabama
## 4 SHR master file            Arizona
## 5 SHR master file            Arizona
## 6 SHR master file            Arizona

This gets only the first column of data and specifies the column by non-descriptive names.

example <- asciiSetupReader::read_ascii_setup(data = data,
setup_file = setup_file, 
select_columns = "V1") # Gets only the first columnss
head(example)
##   IDENTIFIER_CODE
## 1 SHR master file
## 2 SHR master file
## 3 SHR master file
## 4 SHR master file
## 5 SHR master file
## 6 SHR master file