Clustering and Regression

Below are some examples demonstrating unsupervised learning with NNS clustering and nonlinear regression using the resulting clusters. As always, for a more thorough description and definition, please view the References.

NNS Partitioning `NNS.part()`

NNS.part is both a partitional and hierarchical clustering method. NNS iteratively partitions the joint distribution into partial moment quadrants, and then assigns a quadrant identification (1:4) at each partition.

NNS.part returns a data.table of observations along with their final quadrant identification. It also returns the regression points, which are the quadrant means used in NNS.reg.

x = seq(-5, 5, .05); y = x ^ 3

for(i in 1 : 4){NNS.part(x, y, order = i, Voronoi = TRUE, obs.req = 0)}

X-only Partitioning

NNS.part offers a partitioning based on $x$ values only NNS.part(x, y, type = "XONLY", ...), using the entire bandwidth in its regression point derivation, and shares the same limit condition as partitioning via both $x$ and $y$ values.

for(i in 1 : 4){NNS.part(x, y, order = i, type = "XONLY", Voronoi = TRUE)}

Note the partition identifications are limited to 1’s and 2’s (left and right of the partition respectively), not the 4 values per the $x$ and $y$ partitioning.

## $order
## [1] 4
## 
## $dt
##          x         y quadrant prior.quadrant
##   1: -5.00 -125.0000    q1111           q111
##   2: -4.95 -121.2874    q1111           q111
##   3: -4.90 -117.6490    q1111           q111
##   4: -4.85 -114.0841    q1111           q111
##   5: -4.80 -110.5920    q1111           q111
##  ---                                        
## 197:  4.80  110.5920    q2222           q222
## 198:  4.85  114.0841    q2222           q222
## 199:  4.90  117.6490    q2222           q222
## 200:  4.95  121.2874    q2222           q222
## 201:  5.00  125.0000    q2222           q222
## 
## $regression.points
##    quadrant          x           y
## 1:     q111 -4.4166667 -87.0226667
## 2:     q112 -3.1666667 -32.3757917
## 3:     q121 -1.9166667  -7.4164167
## 4:     q122 -0.6666667  -0.4257917
## 5:     q211  0.5833333   0.3148333
## 6:     q212  1.8333333   6.5242083
## 7:     q221  3.0833333  29.9210833
## 8:     q222  4.3583333  83.7120208

Clusters Used in Regression

The right column of plots shows the corresponding regression for the order of NNS partitioning.

for(i in 1 : 3){NNS.part(x, y, order = i, obs.req = 0, Voronoi = TRUE) ; NNS.reg(x, y, order = i, ncores = 1)}

NNS Regression `NNS.reg()`

NNS.reg can fit any $f(x)$, for both uni- and multivariate cases. NNS.reg returns a self-evident list of values provided below.

Univariate:

NNS.reg(x, y, ncores = 1)

## $R2
## [1] 1
## 
## $SE
## [1] 0
## 
## $Prediction.Accuracy
## NULL
## 
## $equation
## NULL
## 
## $x.star
## NULL
## 
## $derivative
##      Coefficient X.Lower.Range X.Upper.Range
##   1:     74.2525         -5.00         -4.95
##   2:     72.7675         -4.95         -4.90
##   3:     71.2975         -4.90         -4.85
##   4:     69.8425         -4.85         -4.80
##   5:     68.4025         -4.80         -4.75
##  ---                                        
## 196:     68.4025          4.75          4.80
## 197:     69.8425          4.80          4.85
## 198:     71.2975          4.85          4.90
## 199:     72.7675          4.90          4.95
## 200:     74.2525          4.95          5.00
## 
## $Point.est
## NULL
## 
## $regression.points
##          x         y
##   1: -5.00 -125.0000
##   2: -4.95 -121.2874
##   3: -4.90 -117.6490
##   4: -4.85 -114.0841
##   5: -4.80 -110.5920
##  ---                
## 197:  4.80  110.5920
## 198:  4.85  114.0841
## 199:  4.90  117.6490
## 200:  4.95  121.2874
## 201:  5.00  125.0000
## 
## $Fitted.xy
##          x         y     y.hat      NNS.ID gradient residuals standard.errors
##   1: -5.00 -125.0000 -125.0000 q4444444444  74.2525         0               0
##   2: -4.95 -121.2874 -121.2874 q4444441444  72.7675         0               0
##   3: -4.90 -117.6490 -117.6490 q4444414444  71.2975         0               0
##   4: -4.85 -114.0841 -114.0841 q4444411444  69.8425         0               0
##   5: -4.80 -110.5920 -110.5920 q4444144444  68.4025         0               0
##  ---                                                                         
## 197:  4.80  110.5920  110.5920 q1111144144  69.8425         0               0
## 198:  4.85  114.0841  114.0841 q1111141444  71.2975         0               0
## 199:  4.90  117.6490  117.6490 q1111114444  72.7675         0               0
## 200:  4.95  121.2874  121.2874 q1111114144  74.2525         0               0
## 201:  5.00  125.0000  125.0000 q1111111444  74.2525         0               0

Multivariate:

Multivariate regressions return a plot of $y$ and $\hat{y}$, as well as the regression points ($RPM) and partitions ($rhs.partitions) for each regressor.

f= function(x, y) x ^ 3 + 3 * y - y ^ 3 - 3 * x
y = x ; z = expand.grid(x, y)
g = f(z[ , 1], z[ , 2])
NNS.reg(z, g, order = "max", ncores = 1)

## $R2
## [1] 1
## 
## $rhs.partitions
##         Var1 Var2
##     1: -5.00   -5
##     2: -4.95   -5
##     3: -4.90   -5
##     4: -4.85   -5
##     5: -4.80   -5
##    ---           
## 40397:  4.80    5
## 40398:  4.85    5
## 40399:  4.90    5
## 40400:  4.95    5
## 40401:  5.00    5
## 
## $RPM
##        Var1  Var2         y.hat
##     1: -4.8 -4.80 -7.105427e-15
##     2: -4.8 -2.55 -8.726063e+01
##     3: -4.8 -2.50 -8.806700e+01
##     4: -4.8 -2.45 -8.883587e+01
##     5: -4.8 -2.40 -8.956800e+01
##    ---                         
## 40397: -2.6 -2.80  3.776000e+00
## 40398: -2.6 -2.75  2.770875e+00
## 40399: -2.6 -2.70  1.807000e+00
## 40400: -2.6 -2.65  8.836250e-01
## 40401: -2.6 -2.60  1.776357e-15
## 
## $Point.est
## NULL
## 
## $Fitted.xy
##         Var1 Var2          y      y.hat      NNS.ID residuals
##     1: -5.00   -5   0.000000   0.000000     201.201         0
##     2: -4.95   -5   3.562625   3.562625     402.201         0
##     3: -4.90   -5   7.051000   7.051000     603.201         0
##     4: -4.85   -5  10.465875  10.465875     804.201         0
##     5: -4.80   -5  13.808000  13.808000    1005.201         0
##    ---                                                       
## 40397:  4.80    5 -13.808000 -13.808000 39597.40401         0
## 40398:  4.85    5 -10.465875 -10.465875 39798.40401         0
## 40399:  4.90    5  -7.051000  -7.051000 39999.40401         0
## 40400:  4.95    5  -3.562625  -3.562625 40200.40401         0
## 40401:  5.00    5   0.000000   0.000000 40401.40401         0

Inter/Extrapolation

NNS.reg can inter- or extrapolate any point of interest. The NNS.reg(x, y, point.est = ...) parameter permits any sized data of similar dimensions to $x$ and called specifically with NNS.reg(...)$Point.est.

NNS Dimension Reduction Regression

NNS.reg also provides a dimension reduction regression by including a parameter NNS.reg(x, y, dim.red.method = "cor", ...). Reducing all regressors to a single dimension using the returned equation NNS.reg(..., dim.red.method = "cor", ...)$equation.

NNS.reg(iris[ , 1 : 4], iris[ , 5], dim.red.method = "cor", location = "topleft", ncores = 1)$equation

##        Variable Coefficient
## 1: Sepal.Length   0.7980781
## 2:  Sepal.Width  -0.4402896
## 3: Petal.Length   0.9354305
## 4:  Petal.Width   0.9381792
## 5:  DENOMINATOR   4.0000000

Thus, our model for this regression would be: \[Species = \frac{0.798*Sepal.Length -0.44*Sepal.Width +0.935*Petal.Length +0.938*Petal.Width}{4} \]

Threshold

NNS.reg(x, y, dim.red.method = "cor", threshold = ...) offers a method of reducing regressors further by controlling the absolute value of required correlation.

NNS.reg(iris[ , 1 : 4], iris[ , 5], dim.red.method = "cor", threshold = .75, location = "topleft", ncores = 1)$equation

##        Variable Coefficient
## 1: Sepal.Length   0.7980781
## 2:  Sepal.Width   0.0000000
## 3: Petal.Length   0.9354305
## 4:  Petal.Width   0.9381792
## 5:  DENOMINATOR   3.0000000

Thus, our model for this further reduced dimension regression would be: \[Species = \frac{\: 0.798*Sepal.Length + 0*Sepal.Width +0.935*Petal.Length +0.938*Petal.Width}{3} \]

and the point.est = (...) operates in the same manner as the full regression above, again called with NNS.reg(...)$Point.est.

NNS.reg(iris[ , 1 : 4], iris[ , 5], dim.red.method = "cor", threshold = .75, point.est = iris[1 : 10, 1 : 4], location = "topleft", ncores = 1)$Point.est

##  [1] 1 1 1 1 1 1 1 1 1 1

Classification

For a classification problem, we simply set NNS.reg(x, y, type = "CLASS", ...).

NOTE: Base category of response variable should be 1, not 0 for classification problems.

NNS.reg(iris[ , 1 : 4], iris[ , 5], type = "CLASS", point.est = iris[1 : 10, 1 : 4], location = "topleft", ncores = 1)$Point.est

##  [1] 1 1 1 1 1 1 1 1 1 1

Cross-Validation `NNS.stack()`

The NNS.stack routine cross-validates for a given objective function the n.best parameter in the multivariate NNS.reg function as well as the threshold parameter in the dimension reduction NNS.reg version. NNS.stack can be used for classification:

NNS.stack(..., type = "CLASS", ...)

or continuous dependent variables:

NNS.stack(..., type = NULL, ...).

Any objective function obj.fn can be called using expression() with the terms predicted and actual.

NNS.stack(IVs.train = iris[ , 1 : 4], 
          DV.train = iris[ , 5], 
          IVs.test = iris[1 : 10, 1 : 4],
          obj.fn = expression( mean(round(predicted) == actual) ),
          objective = "max", type = "CLASS", 
          folds = 1, ncores = 1)

## $OBJfn.reg
## [1] 0.9733333
## 
## $NNS.reg.n.best
## [1] 1
## 
## $probability.threshold
## [1] 0.5
## 
## $OBJfn.dim.red
## [1] 0.96
## 
## $NNS.dim.red.threshold
## [1] 0.78
## 
## $reg
##  [1] 1 1 1 1 1 1 1 1 1 1
## 
## $dim.red
##  [1] 1 1 1 1 1 1 1 1 1 1
## 
## $stack
##  [1] 1 1 1 1 1 1 1 1 1 1

Increasing Dimensions

Given multicollinearity is not an issue for nonparametric regressions as it is for OLS, in the case of an ill-fit univariate model a better option may be to increase the dimensionality of regressors with a copy of itself and cross-validate the number of clusters n.best via:

NNS.stack(IVs.train = cbind(x, x), DV.train = y, method = 1, ...).

set.seed(123)
x <- rnorm(100); y <- rnorm(100)

nns.params <- NNS.stack(IVs.train = cbind(x, x),
                        DV.train = y,
                        method = 1, ncores = 1)

NNS.reg(cbind(x, x), y, 
        n.best = nns.params$NNS.reg.n.best,
        point.est = cbind(x, x), ncores = 1)

Getting Started with NNS: Clustering and Regression

Fred Viole

Clustering and Regression

NNS Partitioning `NNS.part()`

X-only Partitioning

Clusters Used in Regression

NNS Regression `NNS.reg()`

Univariate:

Multivariate:

Inter/Extrapolation

NNS Dimension Reduction Regression

Threshold

Classification

Cross-Validation `NNS.stack()`

Increasing Dimensions

References

Getting Started with NNS: Clustering and Regression

Fred Viole

Clustering and Regression

NNS Partitioning NNS.part()

X-only Partitioning

Clusters Used in Regression

NNS Regression NNS.reg()

Univariate:

Multivariate:

Inter/Extrapolation

NNS Dimension Reduction Regression

Threshold

Classification

Cross-Validation NNS.stack()

Increasing Dimensions

References

NNS Partitioning `NNS.part()`

NNS Regression `NNS.reg()`

Cross-Validation `NNS.stack()`