Detecting Skin Diseases

Mads Lindskou

Dermatitis

The UCI data set (Ilter and Altay Güvenir 1998) consist of

and a class variable with six different skin diseases

Many of the classical machine learning algorithms have been applied to the dataset (Liu et al. 2015). They all achieve a prediction accuracy above \(95\%\) and some even above \(99\%\). But…:

Goal

Given a new patient \(y\), we want to test the hypotheses

\[\begin{align*} H_1: & y \text{ has psoriasis} \\ H_2: & y \text{ has seborrheic dermatitis} \\ H_3: & y \text{ has lichen planus} \\ H_4: & y \text{ has pityriasis rosea} \\ H_5: & y \text{ has chronic dermatitis} \\ H_6: & y \text{ has pityriasis rubra pilaris} \end{align*}\]

Since all hypotheses are exclusive we do not correct for multiple hypothesis testing (but the user can do this by setting the significance level accordingly).

Modeling Attributes of Psoriasis

We first show how to test \(H_1\). First extract the psoriasis data:

library(dplyr)
library(molic)
y     <- unlist(derma[80, -35]) # a patient with seboreic dermatitis
psor  <- derma %>%
  filter(ES == "psoriasis") %>%
  dplyr::select(-ES)

Next, we fit the interaction graph for the psoriasis patients:

library(ess)
g <- fit_graph(psor, q = 0, trace = FALSE)

We can color the nodes corresponding to clinical attributes (red), histopathological attributes (green) and the age variable (gray):

vs   <- names(adj_lst(g))
vcol <- structure(vector("character", length(vs)), names = vs)
vcol[grepl("c", vs)] <- "tomato"  # clinical attributes
vcol[grepl("h", vs)] <- "#98FB98" # histopathological attributes
vcol["age"]          <- "gray"    # age variable
plot(g, vcol, vertex.size = 10, vertex.label = NA)

The take home message here is, that we cannot assume independence between the attributes for the psoriasis patient as seen in the interaction graph - there are many associations.

Outlier Model for Psoriasis Patients

set.seed(300718)
m <- fit_outlier(psor, g, y)
print(m)

 -------------------------------- 
  Simulations: 10000 
  Variables: 34 
  Observations: 112 
  Estimated mean: 42.74 
  Estimated variance: 37.73 
 --------------------------------
  Critical value: 53.70377 
  Deviance: 50.55186 
  P-value: 0.1054 
  Alpha: 0.05 
  <novelty, outlier_model, list> 
 --------------------------------

Notice that that the number of observations is \(112\) even though we have only observed \(111\) psoriasis patients. This is because, under the hypothesis, \(H_1\), the new observation \(y\) has psoriasis. The other summary statistics is self explanatory.

Plotting the Approximated Density of the Test Statistic

plot(m)

The red area is the critical region (here 5%) and the dotted line is the observed test statistic (the deviance) of \(y\). Since the dotted line is outside the critical region, we cannot reject that \(y\) has psoriasis.

Testing all Hypothesis Simultaneously

We can use the fit_multiple_models function to test all six hypothesis as follows.

set.seed(300718)
mm <- fit_multiple_models(derma, y, "ES", q = 0,trace = FALSE) 
plot(mm)

Thus, we cannot reject that \(y\) has either psoriasis, seboreic dermatitis or pityriasis rosea. This is conservative compared to classification methods and hence a little safer. The medical expert should proceed the investigation from here.

References

Ilter, and H. Altay Güvenir. 1998. “UCI Machine Learning Repository.” Department of Computer Engineering; Information Science. https://archive.ics.uci.edu/ml/datasets/Dermatology.

Liu, Tong, Liang Hu, Chao Ma, Zhi-Yan Wang, and Hui-Ling Chen. 2015. “A Fast Approach for Detection of Erythemato-Squamous Diseases Based on Extreme Learning Machine with Maximum Relevance Minimum Redundancy Feature Selection.” International Journal of Systems Science 46 (5): 919–31.