Variable selection in DEA with adea method

Fernando Fernandez-Palacin1

Manuel Munoz-Marquez2

2022-02-09

Introduction

Variable selection in DEA is a question that requires full attention before the results of an analysis can be used in a real case, because its results can be significantly modified depending on the variables included in the model. So, variable selection is a keystone step in each DEA application.

adea provides a measure called load of the contribution of a variable into a DEA model. In an ideal case, when all variables contribute in same way, all loads will be 1. Thus, for example, if an output variable load is 0.75, means that its contribution is 75% of the average value for all outputs. A value for variable load lower than 0.6 means that its contribution to DEA model is negligible.

For more information see (Fernandez-Palacin, Lopez-Sanchez, and Munoz-Marquez 2018) and (Villanueva-Cantillo and Munoz-Marquez 2021).

Let’s load and have a look at the tokyo_libraries dataset with

data(tokyo_libraries)
head(tokyo_libraries)
#>   Area.I1 Books.I2 Staff.I3 Populations.I4 Regist.O1 Borrow.O2
#> 1   2.249  163.523       26         49.196     5.561   105.321
#> 2   4.617  338.671       30         78.599    18.106   314.682
#> 3   3.873  281.655       51        176.381    16.498   542.349
#> 4   5.541  400.993       78        189.397    30.810   847.872
#> 5  11.381  363.116       69        192.235    57.279   758.704
#> 6  10.086  541.658      114        194.091    66.137  1438.746

Step wise variable selection

Two step wise variable selection functions are provided. The first one drops variables one by one giving a set of nested models. The following code setup input and output variables and do the call

input <- tokyo_libraries[, 1:4]
output <- tokyo_libraries[, 5:6]
adea_hierarchical(input, output)
#>       Load #Efficients #Variables #Inputs #Outputs
#> 6 inoutput           6          6       4        2
#> 5 inoutput           6          5       3        2
#> 4 inoutput           4          4       3        1
#> 3 inoutput           2          3       2        1
#> 2 inoutput           1          2       1        1
#> 1 inoutput           0          1       0        0
#>                                        Inputs              Outputs
#> 6 Area.I1, Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 5          Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 4          Books.I2, Staff.I3, Populations.I4            Borrow.O2
#> 3                    Books.I2, Populations.I4            Borrow.O2
#> 2                                    Books.I2            Borrow.O2
#> 1

The load of the first model is 0.455467 which is under the minimum significance level, so Area.I1 can be removed from the model.

When a variable is removed what one can expect is that the load of all variables raise, but after the second model this not happen. So third model is poorer than second and there is no statistical reason to select it.

To avoid that a second step wise selection variable is provided, the new call is

adea_parametric(input, output)
#>       Load #Efficients #Variables #Inputs #Outputs
#> 6 0.455467           6          6       4        2
#> 5 0.990164           6          5       3        2
#> 2 1.000000           1          2       1        1
#>                                        Inputs              Outputs
#> 6 Area.I1, Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 5          Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 2                                    Books.I2            Borrow.O2

In both case, all variables have been taken into account to remove them, but load.orientation parameter allows to select which variables have to be included in load analysis, input for only input variables, output for only output variables, and inoutput, the default value for all variables. The next call consider only output variables as candidate variables to be removed:

adea_parametric(input, output, load.orientation = 'output')
#>   Load #Efficients #Variables #Inputs #Outputs
#> 6    1           6          6       4        2
#> 5    1           4          5       4        1
#>                                        Inputs              Outputs
#> 6 Area.I1, Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 5 Area.I1, Books.I2, Staff.I3, Populations.I4            Borrow.O2

adea_hierarchical and adea_parametric return a list, called models, with all computed model that can be accessed through the following call

m <- adea_hierarchical(input, output)
m4 <- m$models[[4]]
m4
#>         1         2         3         4         5         6         7         8 
#> 0.3026132 0.6425505 0.5733000 0.7164871 0.6733832 1.0000000 0.6967419 0.4476942 
#>         9        10        11        12        13        14        15        16 
#> 1.0000000 0.7051438 0.5336592 0.7583527 0.5915395 0.7215430 0.7832606 0.5822710 
#>        17        18        19        20        21        22        23 
#> 0.8451129 0.7867065 1.0000000 0.8485716 0.7285929 0.7849437 1.0000000

where the number in square brackets is the number of total variables in the model.

By default, when print function is called with an adea model, it prints only efficiencies. summary results in a wider output:

summary(m4)
#> Model name: 
#> Orientation is input
#> Inputs: Books.I2 Staff.I3 Populations.I4 
#> Outputs: Borrow.O2 
#> Input loads:  1.193651 0.9031744 0.9031744 
#> Output loads:  1 
#> Model load: 0.903174350658053
#> #Efficients: 4
#> Efficiencies:
#>         1         2         3         4         5         6         7         8 
#> 0.3026132 0.6425505 0.5733000 0.7164871 0.6733832 1.0000000 0.6967419 0.4476942 
#>         9        10        11        12        13        14        15        16 
#> 1.0000000 0.7051438 0.5336592 0.7583527 0.5915395 0.7215430 0.7832606 0.5822710 
#>        17        18        19        20        21        22        23 
#> 0.8451129 0.7867065 1.0000000 0.8485716 0.7285929 0.7849437 1.0000000 
#> Summary of efficiencies:
#>      Mean        sd      Min.   1st Qu.    Median   3rd Qu.      Max. 
#> 0.7270638 0.1793772 0.3026132 0.6170450 0.7215430 0.8159097 1.0000000

References

Fernandez-Palacin, Fernando, Marı́a Auxiliadora Lopez-Sanchez, and Manuel Munoz-Marquez. 2018. “Stepwise selection of variables in DEA using contribution loads.” Pesquisa Operacional 38 (1): 31–52. http://dx.doi.org/10.1590/0101-7438.2018.038.01.0031.

Villanueva-Cantillo, Jeyms, and Manuel Munoz-Marquez. 2021. “Methodology for Calculating Critical Values of Relevance Measures in Variable Selection Methods in Data Envelopment Analysis.” European Journal of Operational Research 290 (2): 657–70. https://doi.org/10.1016/j.ejor.2020.08.021.


  1. Universidad de Cádiz, ↩︎

  2. Universidad de Cádiz, ↩︎