Creating FFTrees with FFTrees()

Nathaniel Phillips

2022-07-18

The FFTrees() function is at the heart of the FFTrees package. The function takes a training dataset as an argument, and generates several fast-and-frugal trees which attempt to classify cases into one of two classes (True or False) based on cues (aka., features).

Example: heartdisease

We’ll create FFTrees for heartdisease diagnosis data. The full dataset is stored as heartdisease. For modelling purposes, I’ve split the data into a training (heart.train), and test (heart.test) dataframe. Here’s how they look:

# Training data
head(heartdisease)
## # A tibble: 6 × 14
##   diagnosis   age   sex cp    trestbps  chol   fbs restecg thalach exang oldpeak
##   <lgl>     <dbl> <dbl> <chr>    <dbl> <dbl> <dbl> <chr>     <dbl> <dbl>   <dbl>
## 1 FALSE        63     1 ta         145   233     1 hypert…     150     0     2.3
## 2 TRUE         67     1 a          160   286     0 hypert…     108     1     1.5
## 3 TRUE         67     1 a          120   229     0 hypert…     129     1     2.6
## 4 FALSE        37     1 np         130   250     0 normal      187     0     3.5
## 5 FALSE        41     0 aa         130   204     0 hypert…     172     0     1.4
## 6 FALSE        56     1 aa         120   236     0 normal      178     0     0.8
## # … with 3 more variables: slope <chr>, ca <dbl>, thal <chr>
# Test data
head(heartdisease)
## # A tibble: 6 × 14
##   diagnosis   age   sex cp    trestbps  chol   fbs restecg thalach exang oldpeak
##   <lgl>     <dbl> <dbl> <chr>    <dbl> <dbl> <dbl> <chr>     <dbl> <dbl>   <dbl>
## 1 FALSE        63     1 ta         145   233     1 hypert…     150     0     2.3
## 2 TRUE         67     1 a          160   286     0 hypert…     108     1     1.5
## 3 TRUE         67     1 a          120   229     0 hypert…     129     1     2.6
## 4 FALSE        37     1 np         130   250     0 normal      187     0     3.5
## 5 FALSE        41     0 aa         130   204     0 hypert…     172     0     1.4
## 6 FALSE        56     1 aa         120   236     0 normal      178     0     0.8
## # … with 3 more variables: slope <chr>, ca <dbl>, thal <chr>

The critical dependent variable is diagnosis which indicates whether a patient has heart disease (diagnosis = 1) or not (diagnosis = 0). The other variables in the dataset (e.g.; sex, age, and several biological measurements) will be used as predictors (aka., cues).

Create trees with FFTrees()

We will train the FFTs on heart.train, and test their prediction performance in heart.test. Note that you can also automate the training / test split using the train.p argument in FFTrees(). This will randomly split train.p% of the original data into a training set.

To create a set of FFTs, use FFTrees(). We’ll create a new FFTrees object called heart.fft using the FFTrees() function. We’ll specify diagnosis as the (binary) dependent variable, and include all independent variables with formula = diagnosis ~ .

# Create an FFTrees object called heart.fft predicting diagnosis
heart.fft <- FFTrees(formula = diagnosis ~.,
                    data = heart.train,
                    data.test = heart.test)

Elements of an FFTrees object

FFTrees() returns an object with the FFTrees class. There are many elements in an FFTrees object, here are their names:

# Print the names of the elements of an FFTrees object
names(heart.fft)
## [1] "criterion_name" "cue_names"      "formula"        "trees"         
## [5] "data"           "params"         "competition"    "cues"

You can view basic information about the FFTrees object by printing its name. The default tree construction algorithm ifan creates multiple trees with different exit structures. When printing an FFTrees object, you will see information about the tree with the highest value of the goal statistic. By default, goal is weighed accuracy wacc:

# Print the object, with details about the tree with the best training wacc values
heart.fft
## FFTrees 
## - Trees: 7 fast-and-frugal trees predicting diagnosis
## - Outcome costs: [hi = 0, mi = 1, fa = 1, cr = 0]
## 
## FFT #1: Definition
## [1] If thal = {rd,fd}, decide True.
## [2] If cp != {a}, decide False.
## [3] If ca <= 0, decide False, otherwise, decide True.
## 
## FFT #1: Prediction Accuracy
## Prediction Data: N = 153, Pos (+) = 73 (48%) 
## 
## |         | True + | True - |
## |---------|--------|--------|
## |Decide + | hi 64  | fa 19  | 83
## |Decide - | mi 9   | cr 61  | 70
## |---------|--------|--------|
##             73       80       N = 153
## 
## acc  = 81.7%  ppv  = 77.1%  npv  = 87.1%
## bacc = 82.0%  sens = 87.7%  spec = 76.2%
## E(cost) = 0.183
## 
## FFT #1: Prediction Speed and Frugality
## mcu = 1.73, pci = 0.87

Here is a description of each statistic:

statistic long name definition
n N Number of cases
mcu Mean cues used On average, how many cues were needed to classify cases? In other words, what percent of the available information was used on average.
pci Percent cues ignored The percent of data that was ignored when classifying cases with a given tree. This is identical to the mcu / cues.n, where cues.n is the total number of cues in the data.
sens Sensitivity The percentage of true positive cases correctly classified.
spec Specificity The percentage of true negative cases correctly classified.
acc Accuracy The percentage of cases that were correctly classified.
wacc Weighted Accuracy Weighted average of sensitivity and specificity, where sensitivity is weighted by sens.w (by default, sens.w = .5)

Cue accuracy statistics: cue.accuracies

Each tree has a decision threshold for each cue (regardless of whether or not it is actually used in the tree) that maximizes the goal value of that cue when it is applied to the entire training dataset. You can obtain cue accuracy statistics using the calculated decision thresholds from the cue.accuracies list. If the object has test data, you can see the marginal cue accuracies in the test dataset (using the thresholds calculated from the training data):

# Show decision thresholds and marginal classification training accuracies for each cue
heart.fft$cues$stats$train
##         cue     class            threshold direction   n hi fa mi cr      sens
## 1       age   numeric                   57         > 150 38 22 28 62 0.5757576
## 2       sex   numeric                    0         > 150 53 48 13 36 0.8030303
## 3        cp character                    a         = 150 48 18 18 66 0.7272727
## 4  trestbps   numeric                  148         > 150 15  9 51 75 0.2272727
## 5      chol   numeric                  273         > 150 22 17 44 67 0.3333333
## 6       fbs   numeric                    0         > 150 10  9 56 75 0.1515152
## 7   restecg character hypertrophy,abnormal         = 150 40 34 26 50 0.6060606
## 8   thalach   numeric                  154        <= 150 44 29 22 55 0.6666667
## 9     exang   numeric                    0         > 150 31 14 35 70 0.4696970
## 10  oldpeak   numeric                  0.8         > 150 41 22 25 62 0.6212121
## 11    slope character            flat,down         = 150 45 27 21 57 0.6818182
## 12       ca   numeric                    0         > 150 47 19 19 65 0.7121212
## 13     thal character                rd,fd         = 150 47 16 19 68 0.7121212
##         spec       ppv       npv      bacc       acc      wacc cost_decisions
## 1  0.7380952 0.6333333 0.6888889 0.6569264 0.6666667 0.6569264     -0.3333333
## 2  0.4285714 0.5247525 0.7346939 0.6158009 0.5933333 0.6158009     -0.4066667
## 3  0.7857143 0.7272727 0.7857143 0.7564935 0.7600000 0.7564935     -0.2400000
## 4  0.8928571 0.6250000 0.5952381 0.5600649 0.6000000 0.5600649     -0.4000000
## 5  0.7976190 0.5641026 0.6036036 0.5654762 0.5933333 0.5654762     -0.4066667
## 6  0.8928571 0.5263158 0.5725191 0.5221861 0.5666667 0.5221861     -0.4333333
## 7  0.5952381 0.5405405 0.6578947 0.6006494 0.6000000 0.6006494     -0.4000000
## 8  0.6547619 0.6027397 0.7142857 0.6607143 0.6600000 0.6607143     -0.3400000
## 9  0.8333333 0.6888889 0.6666667 0.6515152 0.6733333 0.6515152     -0.3266667
## 10 0.7380952 0.6507937 0.7126437 0.6796537 0.6866667 0.6796537     -0.3133333
## 11 0.6785714 0.6250000 0.7307692 0.6801948 0.6800000 0.6801948     -0.3200000
## 12 0.7738095 0.7121212 0.7738095 0.7429654 0.7466667 0.7429654     -0.2533333
## 13 0.8095238 0.7460317 0.7816092 0.7608225 0.7666667 0.7608225     -0.2333333
##          cost cost_cues
## 1  -0.3333333         0
## 2  -0.4066667         0
## 3  -0.2400000         0
## 4  -0.4000000         0
## 5  -0.4066667         0
## 6  -0.4333333         0
## 7  -0.4000000         0
## 8  -0.3400000         0
## 9  -0.3266667         0
## 10 -0.3133333         0
## 11 -0.3200000         0
## 12 -0.2533333         0
## 13 -0.2333333         0

You can also view the cue accuracies in an ROC plot with plot() combined with the what = "cues" argument. This will show the sensitivities and specificities for each cue, with the top 5 cues highlighted.

# Visualize individual cue accuracies
plot(heart.fft, 
     main = "Heartdisease Cue Accuracy",
     what = "cues")

Tree definitions

The tree.definitions dataframe contains definitions (cues, classes, exits, thresholds, and directions) of all trees in the object. The combination of these 5 pieces of information (as well as their order), define how a tree makes decisions.

# Print the definitions of all trees
heart.fft$trees$definitions
## # A tibble: 7 × 7
##    tree nodes classes cues             directions thresholds          exits    
##   <int> <int> <chr>   <chr>            <chr>      <chr>               <chr>    
## 1     1     3 c;c;n   thal;cp;ca       =;=;>      rd,fd;a;0           1;0;0.5  
## 2     2     4 c;c;n;c thal;cp;ca;slope =;=;>;=    rd,fd;a;0;flat,down 1;0;1;0.5
## 3     3     3 c;c;n   thal;cp;ca       =;=;>      rd,fd;a;0           0;1;0.5  
## 4     4     4 c;c;n;c thal;cp;ca;slope =;=;>;=    rd,fd;a;0;flat,down 1;1;0;0.5
## 5     5     3 c;c;n   thal;cp;ca       =;=;>      rd,fd;a;0           0;0;0.5  
## 6     6     4 c;c;n;c thal;cp;ca;slope =;=;>;=    rd,fd;a;0;flat,down 0;0;0;0.5
## 7     7     4 c;c;n;c thal;cp;ca;slope =;=;>;=    rd,fd;a;0;flat,down 1;1;1;0.5

To understand how to read these definitions, let’s start by understanding tree , the tree with the highest training weighted accuracy

Separate levels in tree definitions are separated by colons ;. For example, tree 4 has 3 cues in the order thal, cp, ca. The classes of the cues are c (character), c and n (numeric). The decision exits for the cues are 1 (positive), 0 (negative), and 0.5 (both positive and negative). This means that the first cue only makes positive decisions, the second cue only makes negative decisions, and the third cue makes both positive and negative decisions.

The decision thresholds are rd and fd for the first cue, a for the second cue, and 0 for the third cue while the cue directions are = for the first cue, = for the second cue, and > for the third cue. Note that cue directions indicate how the tree would make positive decisions if it had a positive exit for that cue. If the tree has a positive exit for the given cue, then cases that satisfy this threshold and direction are classified as positive. However, if the tree has only a negative exit for a given cue, then cases that do not satisfy the given thresholds are classified as negative.

From this, we can understand tree #4 verbally as follows:

If thal is equal to either rd or fd, predict positive. Otherwise, if cp is not equal to a, predict negative. Otherwise, if ca is greater than 0, predict positive, otherwise, predict negative.

You can use the inwords() function to automatically return a verbal description of the tree with the highest training accuracy in an FFTrees object:

# Describe the best training tree

inwords(heart.fft, tree = 1)
## [1] "If thal = {rd,fd}, decide True"                  
## [2] "If cp != {a}, decide False"                      
## [3] "If ca <= 0, decide False, otherwise, decide True"

Accuracy statistics

Here are the training statistics for all trees

# Print training statistics for all trees
heart.fft$trees$stats$train
## # A tibble: 7 × 19
##    tree     n    hi    fa    mi    cr  sens  spec    far   ppv   npv   acc  bacc
##   <int> <int> <int> <int> <int> <int> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1   150    54    18    12    66 0.818 0.786 0.214  0.75  0.846 0.8   0.802
## 2     2   150    57    22     9    62 0.864 0.738 0.262  0.722 0.873 0.793 0.801
## 3     3   150    44     7    22    77 0.667 0.917 0.0833 0.863 0.778 0.807 0.792
## 4     4   150    60    31     6    53 0.909 0.631 0.369  0.659 0.898 0.753 0.770
## 5     5   150    28     2    38    82 0.424 0.976 0.0238 0.933 0.683 0.733 0.700
## 6     6   150    21     1    45    83 0.318 0.988 0.0119 0.955 0.648 0.693 0.653
## 7     7   150    64    56     2    28 0.970 0.333 0.667  0.533 0.933 0.613 0.652
## # … with 6 more variables: wacc <dbl>, cost_decisions <dbl>, cost_cues <dbl>,
## #   cost <dbl>, pci <dbl>, mcu <dbl>

Decisions

The decision list contains the raw classification decisions for each tree for each case.

Here are is how decisions were made based on tree 1

# Look at the tree decisisions
heart.fft$trees$decisions$train$tree_1
## # A tibble: 150 × 6
##    criterion decision levelout cost_cue cost_decision  cost
##    <lgl>     <lgl>       <int>    <dbl>         <dbl> <dbl>
##  1 FALSE     FALSE           2        0             0     0
##  2 FALSE     FALSE           2        0             0     0
##  3 FALSE     FALSE           2        0             0     0
##  4 TRUE      TRUE            1        0             0     0
##  5 FALSE     FALSE           2        0             0     0
##  6 FALSE     TRUE            1        0             1     1
##  7 FALSE     FALSE           2        0             0     0
##  8 TRUE      TRUE            1        0             0     0
##  9 TRUE      TRUE            3        0             0     0
## 10 FALSE     FALSE           2        0             0     0
## # … with 140 more rows

Predicting new data with predict()

Once you’ve created an FFTrees object, you can use it to predict new data using predict(). In this example, I’ll use the heart.fft object to make predictions for cases 1 through 50 in the heartdisease dataset. By default, the tree with the best training wacc values is used.

# Predict classes for new data from the best training tree
predict(heart.fft,
        newdata = heartdisease[1:10,])
##  [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE

To predict class probabilities, include the type = "prob" argument, this will return a matrix of class predictions, where the first column indicates 0 / FALSE, and the second column indicates 1 / TRUE.

# Predict class probabilities for new data from the best training tree
predict(heart.fft,
        newdata = heartdisease,
        type = "prob")
## # A tibble: 303 × 2
##    prob_0 prob_1
##     <dbl>  <dbl>
##  1  0.262  0.738
##  2  0.273  0.727
##  3  0.262  0.738
##  4  0.862  0.138
##  5  0.862  0.138
##  6  0.862  0.138
##  7  0.273  0.727
##  8  0.706  0.294
##  9  0.262  0.738
## 10  0.262  0.738
## # … with 293 more rows

Use type = “both” to get both classification and probability predictions for cases

# Predict classes and probabilities
predict(heart.fft,
        newdata = heartdisease,
        type = "both")
## # A tibble: 303 × 3
##    class prob_0 prob_1
##    <lgl>  <dbl>  <dbl>
##  1 TRUE   0.262  0.738
##  2 TRUE   0.273  0.727
##  3 TRUE   0.262  0.738
##  4 FALSE  0.862  0.138
##  5 FALSE  0.862  0.138
##  6 FALSE  0.862  0.138
##  7 TRUE   0.273  0.727
##  8 FALSE  0.706  0.294
##  9 TRUE   0.262  0.738
## 10 TRUE   0.262  0.738
## # … with 293 more rows

Visualising trees

Once you’ve created an FFTrees object using FFTrees() you can visualize the tree (and ROC curves) using plot(). The following code will visualize the best training tree applied to the test data:

plot(heart.fft,
     main = "Heart Disease",
     decision.labels = c("Healthy", "Disease"))

Define an FFT manually with my.tree

You can also define a specific FFT to apply to a dataset using the my.tree argument. To do so, specify the FFT as a sentence, making sure to spell the cue names correctly as the appear in the data. Specify sets of factor cues using brackets. In the example below, I’ll manually define an FFT using the sentence "If chol > 300, predict True. If thal = {fd,rd}, predict False. Otherwise, predict True"

# Define a tree manually using the my.tree argument
myheart.fft <- FFTrees(diagnosis ~., 
                       data = heartdisease, 
                       my.tree = "If chol > 300, predict True. If thal = {fd,rd}, predict False. Otherwise, predict True")

# Here is the result
plot(myheart.fft, 
     main = "Specifying an FFT manually")

As you can see, this FFT was pretty terrible