Simulation Options

Krystian Igras

2021-09-23

DataFakeR package allows to customize each step of DataFakeR workflow, by setting up proper options using set_faker_opts function (and option-related specific methods).

All the configurable options are stored with the default values within default_faker_opts object.

str(default_faker_opts, max.level = 1)
#> List of 27
#>  $ opt_pull_character             :List of 5
#>  $ opt_pull_numeric               :List of 5
#>  $ opt_pull_integer               :List of 5
#>  $ opt_pull_logical               :List of 2
#>  $ opt_pull_date                  :List of 3
#>  $ opt_pull_table                 :List of 1
#>  $ opt_default_character          :List of 7
#>  $ opt_simul_spec_character       :List of 1
#>  $ opt_simul_restricted_character :List of 2
#>  $ opt_simul_default_fun_character:function (n, not_null, unique, default, nchar, type, na_ratio, levels_ratio, 
#>     ...)  
#>  $ opt_default_numeric            :List of 8
#>  $ opt_simul_spec_numeric         :List of 1
#>  $ opt_simul_restricted_numeric   :List of 3
#>  $ opt_simul_default_fun_numeric  :function (n, not_null, unique, default, type, na_ratio, levels_ratio, ...)  
#>  $ opt_default_integer            :List of 6
#>  $ opt_simul_spec_integer         :List of 1
#>  $ opt_simul_restricted_integer   :List of 3
#>  $ opt_simul_default_fun_integer  :function (n, not_null, unique, default, type, na_ratio, levels_ratio, ...)  
#>  $ opt_default_logical            :List of 6
#>  $ opt_simul_spec_logical         :List of 1
#>  $ opt_simul_restricted_logical   :List of 1
#>  $ opt_simul_default_fun_logical  :function (n, not_null, unique, default, type, na_ratio, levels_ratio, ...)  
#>  $ opt_default_date               :List of 9
#>  $ opt_simul_spec_date            :List of 1
#>  $ opt_simul_restricted_date      :List of 2
#>  $ opt_simul_default_fun_date     :function (n, not_null, unique, default, type, min_date, max_date, format, 
#>     na_ratio, levels_ratio, ...)  
#>  $ opt_default_table              :List of 1

Customizable options can be divided into the main three groups:

Pulling database schema configuration

All the parameters in set_faker_opts prefixed with opt_pull:

See Sourcing structure from database for more details.

Default column-type parameters

Looking at the single column specification of configuration YAML file:

columns:
  column_a1:
    type: char(8)
    not_null: true
    unique: true
    ...

you may find a list of parameters attached to each column. Such parameters are passed to each simulation method and may be used to achieve demanded form of the resulted column.

When the number of columns is large, it may be inconvenient to define such parameters per each column in configuration file. In order to make such configuration easier, you may define the default parameters to each column type with opt_default_<column-type> method.

Simply put:

my_opts <- set_faker_opts(
  opt_default_<column-type> = opt_default_<column-type>(...)
)

The default parameters in DataFakeR can be accessed by default_faker_opts$opt_default_<column-type>.

For example for character type columns we have:

default_faker_opts$opt_default_character
#> $regexp
#> [1] "text|char|factor"
#> 
#> $nchar
#> [1] 10
#> 
#> $not_null
#> [1] FALSE
#> 
#> $unique
#> [1] FALSE
#> 
#> $default
#> [1] ""
#> 
#> $na_ratio
#> [1] 0.05
#> 
#> $levels_ratio
#> [1] 1

That means, whenever we simulate character column and such parameters are not defined in schema YAML file you will get:

as passed parameters and values to simulation methods.

Column type mapping

When looking at the default parameters list, we could find a parameter named regexp. This is exceptional parameter that is not passed to simulation methods but is responsible to map connection between column type defined in configuration YAML file and the target R type.

For example default_faker_opts$opt_default_character$regexp = "text|char", means that whenever column type matches regular expression "text|char" such column will be treated in R as character class one.

You may modify this regular expression if you want to extend the mapping between source column types and the target R column class.

Default table parameters

When simulating the data, except column specific parameters you may also want to pass parameters to the each table. One of them may be specifying number or rows that the resulted table should contain.

Such parameters are configurable by opt_default_table method. Each parameter specified by the method will be then attached to each table and used in simulation process.

Each parameter passed to opt_default_table should be either a constant value, or the function that iterates over all the tables, and returns the proper parameter value for each one.

So, specifying:

set_faker_opts(opt_default_table = opt_default_table(nrows = 10))

will result with attaching nrows = 10 to each table, and as a result (based on DataFakeR functionality) each simulated table will have 10 rows.

Setting up (the default setting):

set_faker_opts(opt_default_table = opt_default_table(nrows = nrows_simul_constant(10)))

will result with attaching nrows = 10 to each table, whenever nrows was not specified in the configuration.

DataFakeR provides also the second method for defining number of rows nrows_simul_ratio that allows to calculate number of rows based on provided ratio and total number of rows in all tables together. For example speficying nrows = nrows_simul_ratio(0.1, 100), will result with:

To understand how to create custom methods please check the definition of nrows_simul_constant and nrows_simul_ratio.

Note The only supported opt_default_table parameter is nrows. In the future releases, the option to set up custom parameters and actively use them in the simulation process will be enabled.

Column-type simulation methods configuration

The last group of configuration parameters is meant to provide an option to customize simulation methods. As presented in simulation methods page, there are four types of simulation:

  1. Deterministic (formula or constraint-based) simulation.
  2. Special method simulation.
  3. Restricted simulation.
  4. Default simulation.

All the type simulation methods (except deterministic one) can be configured with the set_faker_opts using:

set_faker_opts(
  opt_simul_spec_<column-type> = opt_simul_spec_<column-type>(
    <spec-method-name> = <spec-function>
  )
)
set_faker_opts(
  opt_simul_restricted_<column-type> = opt_simul_restricted_<column-type>(
    <restricted-method-name> = <restricted-function>
  )
)
set_faker_opts(
  opt_simul_default_fun_<column-type> = <default-function>
)

The examples showing how to define custom methods and what each method type means are presented at simulation methods.