Building Custom CPOs (No Output)

Martin Binder

2022-07-20

CPO Vignette Navigation

  1. 1. First Steps (compact version)
  2. mlrCPO Core (compact version)
  3. CPOs Built Into mlrCPO (compact version)
  4. Building Custom CPOs (compact version)

Table of Contents

Intro

The CPOs built into mlrCPO can be used for many different purposes, and can be combined to form even more powerful transformation operations. However, in some cases, it may be necessary to define new “custom” CPOs that perform a certain task; either because a preprocessing method is not (yet) defined as a builtin CPO, or because some operation very specific to the task at hand needs to be performed.

For this purpose, mlrCPO offers a very powerful interface for the creation of new CPOs. The functions and methods described here are also the methods used internally to create mlrCPO’s builtin CPOs. Therefore, to learn the art of defining CPOs, it is also possible to look at the mlrCPO source tree in files starting with “CPO_” for example CPO definitions.

There are three types of CPO: “Feature Operation CPOs” (FOCPOs) which are only allowed to change feature columns of incoming data, and which are the most common CPOs; “Target Operation CPOs” (TOCPOs) that change only target columns, and “Retrafoless CPOs” (ROCPOs) that may add or delete rows to a data set, but only during training. Conceptually, ROCPOs are the simplest CPOs, followed by FOCPOs and the even more complicated TOCPOs. The commonalities of all CPO defining functions will be described first, followed by the different CPO types in order of growing complexity.

Making a CPO

To create a CPOConstructor that can then be used to create a CPO, a makeCPO*() function needs to be called. There are five functions of this kind, differing by what kind of CPO they create and how much flexibility (at the cost of simplicity) they offer the user:

CPO type makeCPO*() functions
FOCPO makeCPO(), makeCPOExtendedTrafo()
TOCPO makeCPOTargetOp(), makeCPOExtendedTargetOp()
ROCPO makeCPORetrafoless()

Each of these functions takes a “name” for the new CPO, settings for the parameter set to be used, settings for the format in which the data is supposed to be provided, data property settings, the packages to load, CPO type specific settins, and finally the transformation functions.

CPO name

Each CPO has a “name” that is used for representation when printing, and as the default prefix for hyperparameters. cpoPca, for example, has the name “pca”:

!cpoPca()

The name is set using the cpo.name parameter of the make*() functions.

CPO parameters

The ParSet used by the CPO are given as the second par.set parameter. These parameters must be either constructed using makeParamSet() from the ParamHelpers package, or using the pSS() function for a more concise ParSet definition. The given parameters will then be the function parameters of the CPOConstructor, and will by default be exported as hyperparameters (prefixed with the cpo.name).

It is possible to use the default parameter values of the par.set as defaults, or to give a par.vals list of default values. If par.vals is given, the defaults within par.set are completely ignored. Parameters that have a default value are set to this value upon construction if no value is given by the user.

Not all available parameters of a CPO need to be exported as hyperparameters. Which parameters are exported can be set during CPO construction, but the default exported parameters can be set using export.params. This can either be a character vector of the names of parameters to export, or TRUE (default, export all) or FALSE (no export).

Data Format

Different CPO operations may want to operate on the data in different forms: as a Task, as a data.frame with or without the target column, etc. The CPO framework can perform some conversion of data to fit different needs, which is set up by the value of fthe dataformat parameter, together with dataformat.factor.with.ordered. While dataformat has slightly different effects on different CPO types, typically its values and effects are:

dataformat Effect
"task" Data is given as a Task; if the data to be transformed is a data.frame, it is converted to a cluster task before handing it to the transformation functions.
"df.all" Data is given as a data.frame, with the target column included.
"df.features" Data is given as a data.frame, the target is given as a separate data.frame.
"split" Data is given as a named list with slots $numeric, $factor, $ordered, $other, each of which contains a data.frame with the columns of the respective type. If dataformat.factor.with.ordered is TRUE, the $ordered slot is not present, and ordered features are instead given to $factor as well. Features that are not any of these types are given to "other". The target is given as a separate data.frame.
"factor", "ordered", "numeric" Only the data from columns of the named type are given to the transformatin functions as a data.frame. The target columns are given as a separate data.frame.

Another parameter influencing the data format is the fix.factors flag which controls whether factor levels of prediction data need to be set to be the same as during training. If it is TRUE, previously unseen factor levels are set to NA during prediction.

Properties

mlr and mlrCPO make it possible to specify what kind of data a CPO or a Learner can handle. However, since CPOs may change data to be more or less fitting for a certain Learner, a CPO must announce not only what data it can handle, but also how it changes the capabilities of the machine learning pipeline in which it is envolved. During construction, four parameters related to properties can be given.

The properties.data parameter defines what properties of feature data the CPO can handle; it must be a subset of "numerics", "factors", "ordered", and "missings". Typically, only the "missings" part is interesting since CPOs that only handle a subset of types will usually just ignore columns of other types.

The properties.target parameter defines what Task properties related to the task type and the target column a CPO can handle. It is a subset of "cluster", "classif", "multilabel", "regr", "surv" (so far defining the task type a CPO can handle), "oneclass", "twoclass", "multiclass" (properties specific to classif Tasks). Most FOCPOs do not care about the task type, while TOCPOs may only support a single task type.

properties.adding lists the properties that a CPO adds to the capabilities of a machine learning pipeline when it is executed before it, while properties.needed lists the properties needed from the following pipeline. cpoDummyEncode, for example, a CPO that converts factors and ordereds to numerics, has properties.adding == c("factors", "ordered") and properties.needed == "numerics". The many imputation CPOs have properties.adding == "missings". Usually these are only a subset of the possible properties.data states, but for TOCPOs this may also be any of "oneclass", "twoclass", "multiclass". Note that neither properties.adding nor properties.needed may be any task type, even for TOCPOs that perform task conversion.

Property Checking and .sometimes Properties

The CPO framework will check that a CPO only adds and removes the kind of data properties that it declared in properties.adding and properties.needed. It will also check that composition of CPOs, and attachment of CPOs to Learners, work out. Sometimes, however, it is necessary to treat a CPO like it does a certain manipulation (removing missings, for example) in some cases, while not in others. A CPO that only imputes missings in numeric columns should be treated as properties.adding == "missings" when is is attached to a Learner, and the Learner should gain the "missings" property. However, when data that has missings in its factorial columns is given to this CPO, the CPO framework will complain that the CPO that declared "missings" in properties.adding returned data that still had missing values in it. The solution to this dilemma is to suffix some properties with “.sometimes” when declaring them in properties.adding and properties.needed. When composing CPOs, and when checking data returned by a CPO, the framework will then be as lenient as possible. In the given example, properties.adding == "missings" will be assumed when attaching the CPO to a Learner, while properties.adding == character(0) is assumed when checking the CPO’s output (and missing values that were not imputed are therefore forgiven).

Packages

The single packages parameter can be set to a character vector listing packages necessary for a CPO to work. This is mostly useful when a CPO should be defined as part of a package or script to be distributed. The listed package will not automatically be attached, it will only be loaded. This means that a function exported by a package still needs to be called using ::. The benefit of declaring it in packages is that it will be loaded upon construction of a CPO, which means that a user will get immediate feedback about whether the CPO can be used or needs more packages to be installed.

Transformation Functions

The different types of CPO, and the different make*() functions, need different transformation functions to be defined. The principle behind these functions is alwasy the same, however: The CPO framework takes input data, transforms it according to dataformat, checks it according to properties.data and properties.target, and then gives it to one or more user-given transformation function. The transformation function must then usually create a control object containing information about the data to be used later, or transform the incoming data and return the transformation result (or both). The CPO framework then checks the transformed data according to properties.adding and properties.needed and gives it back to the CPO user.

Transformation functions are given to parameters starting with cpo.. They can either be given as functions, or as “headless” functions missing the function(...) part. In the latter case, the headless function must be a succession of expressions enclosed in curly braces ({, }) and the necessary function head is added by the CPO framework. The functions often take a subset of data, target, control, or control.invert parameters, in addition to all parameters as given in par.set.

Functional Transformation

The communication between transformation functions, e.g. giving the PCA matrix to its retrafo function, usually happens via “control” objects created by these functions and then given as parameter to other functions. In some cases, however, it may be more elegant to create a new function (e.g. a cpo.retrafo function) within another function as a “closure” (in the general, not R specific, sense) with access to all the outer functions variables. The CPO framework makes this possible by allowing a function to be given instead of a “control” object. The function which would usually receive this control object must then be given as NULL in the makeCPO*() call.

Retrafoless CPOs

Retrafoless CPOs, or ROCPOs, are conceptually the simplest CPO type, since they do not create CPOTrained objects and therefore only need one transformation function: cpo.trafo. The value of the dataformat parameter may only be either "df.all" or "task", resulting in either a data.frame (consisting all columns, including the target column) or a Task being given to the cpo.trafo function. cpo.trafo should have the parameters data (receiving the data as either a Task or data.frame), target (receiving the names of target columns in the data), and any parameter as given to par.set. The return value of cpo.trafo must be the transformed data, in the same format (data.frame or Task) as given as input.

Since a ROCPO only transforms incoming data during training, it should not do any transformation of target or feature values that would make it necessary to repeat this action during prediction. It may, for example, be used for subsampling a classification task to balance target classes, but it should not change the levels or values of given data rows.

The following is an example of a simplified version of the cpoSample CPO, which takes one parameter fraction and then subsamples a fraction part of incoming data without replacement:

xmpSample = makeCPORetrafoless("exsample",  # nolint
  pSS(fraction: numeric[0, 1]),
  dataformat = "df.all",
  cpo.trafo = function(data, target, fraction) {
    newsize = round(nrow(data) * fraction)
    row.indices = sample(nrow(data), newsize)
    data[row.indices, ]
  })

cpo = xmpSample(0.01)
iris %>>% cpo

It is possible to give the cpo.trafo as headless transformation function by just leaving out the function header. This can save a lot of boilerplate code when there are many parameters present, or when many transformation functions need to be given. The resulting CPO is completely equivalent to the one given above.

xmpSampleHeadless = makeCPORetrafoless("exsample",  # nolint
  pSS(fraction: numeric[0, 1]),
  dataformat = "df.all",
  cpo.trafo = {
    newsize = round(nrow(data) * fraction)
    row.indices = sample(nrow(data), newsize)
    data[row.indices, ]
  })

Feature Operation CPOs

FOCPOs are created with either the makeCPO() function, or the makeCPOExtendedTrafo() function. The former conceptually separates training from transformation, the latter separates transformation of training data from transformation of prediction data.

makeCPO()

In principle, a FOCPO needs a function that “trains” a control object depending on the data (cpo.train), and another function that uses this control object, and new data, to perform the preprocessing operation (cpo.retrafo). The cpo.train-function must return a “control” object which contains all information about how to transform a given dataset. cpo.retrafo takes a (potentially new!) dataset and the “control” object returned by cpo.trafo, and transforms the new data according to plan.

In contrast to makeCPORetrafoless(), the dataformat parameter of makeCPO() can take all values described in the section Data Format. The cpo.train function takes the arguments data, target, and any other parameter described in param.set. The data value is the incoming data as a Task, a data.frame with or without the target column, or a list of data.frames of different column types, according to dataformat. The target value is a character vector of target names if dataformat is "task" or "df.all", or a data.frame of the target columns otherwise.

The cpo.train function’s return value is treated as a control object and given to the cpo.retrafo function. Its parameters are data, control, and any parameters in par.set. The format of the data given to the data parameter is according to dataformat, with the exception that if dataformat is either "task" or "df.all", it will be treated here as if its value were "df.features". This is because the cpo.retrafo function is sometimes called with prediction data which does not have any target column at all.

It follows the simplified definition of a CPO that removes the numeric columns of smallest variance, returning a dataset of only n.col numeric columns. The dataformat variable is set to "numeric", so that only numeric columns are given to the CPO’s transformation functiosn; factorial columns are ignored. In cpo.trafo, calculates the variance of each of the data’s columns, and in cpo.retrafo it subsets the data according to these variances. Since cpo.retrafo may also be called during prediction with new data, the variance must not be calculated in cpo.retrafo–this could lead to cpo.retrafo filtering out different columns from cpo.trafo. This example also prints out which of its functions are being called.

xmpFilterVar = makeCPO("exemplvar",  # nolint
  pSS(n.col: integer[0, ]),
  dataformat = "numeric",
  cpo.train = function(data, target, n.col) {
    cat("*** cpo.train ***\n")
    sapply(data, var, na.rm = TRUE)
  },
  cpo.retrafo = function(data, control, n.col) {
    cat("*** cpo.retrafo ***\n")
    cat("Control:\n")
    print(control)
    cat("\n")
    greatest = order(-control)  # columns, ordered greatest to smallest var
    data[greatest[seq_len(n.col)]]
  })

cpo = xmpFilterVar(2)

(Note that the function heads are optional.)

When the CPO is called with a dataset, the cpo.train function is called first, creating the control object which is then given to cpo.retrafo.

(trafd = head(iris) %>>% cpo)

Note that the two columns of the entire iris dataset with the greatest variance are Petal.Length and Sepal.Length:

head(iris %>>% cpo)

However, when applying the retrafo() of trafd to the entire dataset, the same columns are filtered out as they were in the first transformation: Sepal.Width and Sepal.Length. When the retrafo() is used, cpo.train is not called; instead, the control object saved inside the retrafo is used.

head(iris %>>% retrafo(trafd))

It is also possible to inspect the CPOTrained object to see that the control is there:

getCPOTrainedState(retrafo(trafd))

Functional FOCPO

Instead of returning the control object, cpo.train may also return the cpo.retrafo function. This may be more succinct to write if there are many little pieces of information from the cpo.train run that the cpo.retrafo function should have access to.

When cpo.retrafo is given functionally, it should be a function with only one parameter: the newly incoming data. It can access the values of the par.set parameters from its encapsulating environment in cpo.train.

Note that the data and target values given to cpo.train are deleted after the cpo.train call, so cpo.retrafo does not have access to it. In fact, the CPO framework will give a warning about this.

xmpFilterVarFunc = makeCPO("exemplvar.func",  # nolint
  pSS(n.col: integer[0, ]),
  dataformat = "numeric",
  cpo.retrafo = NULL,
  cpo.train = function(data, target, n.col) {
    cat("*** cpo.train ***\n")
    ctrl = sapply(data, var, na.rm = TRUE)
    function(x) {  # the data is given to the only present parameter: 'x'
      cat("*** cpo.retrafo ***\n")
      cat("Control:\n")
      print(ctrl)
      cat("\ndata:\n")
      print(data)  # 'data' is deleted: NULL
      cat("target:\n")
      print(target)  # 'target' is deleted: NULL
      greatest = order(-ctrl)  # columns, ordered greatest to smallest var
      x[greatest[seq_len(n.col)]]
    }
  })

cpo = xmpFilterVarFunc(2)

(Note that the function heads are optional.)

(trafd = head(iris) %>>% cpo)

The CPOTrained state for a functional CPO is the environment of the retrafo function. It contains the “ctrl” variable defined during training, the parameters given to cpo.train, and the cpo.retrafo function itself. Note that data and target are deleted and replaced by different values.

getCPOTrainedState(retrafo(trafd))

Stateless FOCPO

“Stateless” CPOs are CPOs that perform the same action during transformation of training and prediction data, independent from information during training. An example would be a CPO that converts all its columns to numeric columns. When a FOCPO does not need a state, the cpo.train parameter of makeCPO() can be set to NULL. The cpo.retrafo function then has no control paramter and instead only a data and any par.set parameter. The as.numeric-CPO could be written as the following:

xmpAsNum = makeCPO("asnum",  # nolint
  cpo.train = NULL,
  cpo.retrafo = function(data) {
    data.frame(lapply(data, as.numeric))
  })

cpo = xmpAsNum()

(Note that the function head is optional.)

(trafd = head(iris) %>>% cpo)

The “state” of the CPOTrained object thus created only contains information about the incoming data shape, to make sure that the CPOTrained object is only used on conforming data (as doing otherwise would indicate a bug).

getCPOTrainedState(retrafo(trafd))

makeCPOExtendedTrafo()

Sometimes it is advantageous to have the training operation return the transformed data right away. PCA, for example, returns the rotation matrix and the transformed data; it would be a waste of time to only return the rotation matrix in a cpo.train function and apply it on the training data in cpo.retrafo. The makeCPOExtendedTrafo() function works very much like makeCPO(), with the difference that it has a cpo.trafo instead of a cpo.train function parameter. The cpo.trafo takes the same parameters as cpo.train, but returns the transformed data instead of a control object. The control object needs to be created additionally, as a variable by the cpo.trafo function. The CPO framework takes the value of a variable named control inside the cpo.trafo function and gives it to the cpo.retrafo function.

The following is a simplified version of the cpoPca CPO, which does not scale or center the data.

xmpPca = makeCPOExtendedTrafo("simple.pca",  # nolint
  pSS(n.col: integer[0, ]),
  dataformat = "numeric",
  cpo.trafo = function(data, target, n.col) {
    cat("*** cpo.trafo ***\n")
    pcr = prcomp(as.matrix(data), center = FALSE, scale. = FALSE, rank = n.col)
    # save the rotation matrix as 'control' variable
    control = pcr$rotation
    pcr$x
  },
  cpo.retrafo = function(data, control, n.col) {
    cat("*** cpo.retrafo ***\n")
    # rotate the data by the rotation matrix
    as.matrix(data) %*% control
  })

cpo = xmpPca(2)

When this CPO is applied to data, only the cpo.trafo function is called.

(trafd = head(iris) %>>% cpo)

When the retrafo CPOTrained is used, the cpo.retrafo function is called, making use of the rotation matrix.

tail(iris) %>>% retrafo(trafd)

The rotation matrix can be inspected using getCPOTrainedState.

getCPOTrainedState(retrafo(trafd))

Functional FOCPO

As with makeCPO(), makeCPOExtendedTrafo() makes it possible to define functional CPOs. Instead of returning a cpo.retrafo function, the cpo.retrafo function needs to be defined as a variable, instead of a “control” variable. Like in makeCPO(), the cpo.retrafo parameter of makeCPOExtendedTrafo() must then be NULL. The PCA example above could thus also be written as

xmpPcaFunc = makeCPOExtendedTrafo("simple.pca.func",  # nolint
  pSS(n.col: integer[0, ]),
  dataformat = "numeric",
  cpo.retrafo = NULL,
  cpo.trafo = function(data, target, n.col) {
    cat("*** cpo.trafo ***\n")
    pcr = prcomp(as.matrix(data), center = FALSE, scale. = FALSE, rank = n.col)
    # save the rotation matrix as 'control' variable
    cpo.retrafo = function(data) {
      cat("*** cpo.retrafo ***\n")
      # rotate the data by the rotation matrix
      as.matrix(data) %*% pcr$rotation
    }
    pcr$x
  })

cpo = xmpPcaFunc(2)
(trafd = head(iris) %>>% cpo)

This also serves as an example of the disadvantages of a functional CPO: Since the CPO state contains all the information contained in the cpo.trafo call (except the data and target variables), it may take up more memory than needed. For this CPO, the state contains the pcr variable which contains the transformed training data in its $x slot. If the training data is a very large dataset, this would result in CPO states that take up a lot of working memory.

getCPOTrainedState(retrafo(trafd))$pcr$x

Target Operation CPOs

TOCPOs are more complicated than FOCPOs, since they potentially need to operate on data at three different points: During initial training, during the re-transformation for new prediction data, and during the inversion of predictions made by a model trained on transformed data. Similarly to makeCPO(), makeCPOTargetOp() splits these operations up into functions that create “control” objects, and functions that do the actual transformation. makeCPOExtendedTargetOp(), on the other hand, gives the user more flexibility at the price of the user having to make sure that transformation and retransformation perform the same operation–similarly to makeCPOExtendedTrafo() for FOCPOs.

Task Type and Conversion

In contrast to FOCPOs, TOCPOs can only operate on one type of Task. Therefore, the properties.target parameter of makeCPO*TargetOp() must contain exactly one Task type ("cluster", "classif", "regr", "surv", "multilabel") and possibly some more task properties (currently only "oneclass", "twoclass", "multiclass" if the Task type is "classif").

It is possible to write TOCPOs that perform conversion of Task types. For that, the task.type.out parameter must be set to the Task type that the CPO converts the data to. If conversion happens, the transformation functions need to return target data fit for the task.type.out Task type.

properties.adding and properties.needed should not be any Task type, even when conversion happens. Only if one of the task types has additional properties–currently only the "oneclass", "twoclass", "multiclass" properties of classification Tasks–should these additional properties be listed in properties.adding or properties.needed.

predict.type

mlr makes it possible for Learners to make different kinds of prediction. Usually they can predict a “response”, making their best effort to predict the true value of a task target. Many Learner types can predict a probability when their predict.type is set to "prob", returning a data.frame of their estimated probability distribution over possible responses. For regression Learners, predict.type can be "se" for the Learner to predict its estimated standard error of their response prediction.

When TOCPOs invert these predictions, they may

This is done using the predict.type.map parameter of makeCPO*TargetOp(). It is a named list or named character vector with the names indicating the supported predict.types, and the values indicating the required underlying predictions. For example, if a TOCPO can perform "response" and "se" prediction, and to predict "response" the underlying Learner must also perform "response" prediction, but for "se" prediction it must perform "prob" prediction, the predict.type.map would have the value

c(response = "response", se = "prob")

makeCPOTargetOp()

makeCPOTargetOp() has a cpo.train and cpo.retrafo function parameter that work similarly to the ones of makeCPO(). In contrast to makeCPO(), however, cpo.retrafo must return the target data instead of the feature data. The data and target parameters of cpo.retrafo get the same data as they get in a FOCPO created with makeCPO(), with the exception that if dataformat is "task" or "df.all", the target parameter will receive the whole input data in form of a Task or data.frame (while the data argument, as in a FOCPO, will receive only the feature data.frame). The return value of cpo.retrafo for a TOCPO must always be in the same format as the input target value: a data.frame with the manipulated target values when dataformat is anything besides "task" or "df.all", or a Task or data.frame of all data (with non-target columns unmodified) otherwise.

Inversion of predictions is performed using the functions cpo.train.invert and cpo.invert. cpo.train.invert takes a data and a control argument, and any arguments declared in the par.set. It is called whenever new data is fed into the CPO or its retrafo CPOTrained, and creates a CPOTrained state that is used to invert the prediction done on this new data. The control argument takes the value returned by the cpo.train function upon initial training, and the data argument is the new data for which to prepare the CPOTrained inverter. It has the form dictated by dataformat, with the exception that "task" and "df.all" dataformat are handled as "df.feature"; this is necessary since the new data could be a data.frame of data with unknown target.

The following is an example of a TOCPO that trains a classification Learner on a binary classification Task and changes it to a Task of whether or not the Learner predicted the truth for a given data line correctly. (Real-world applications would probably need to take some precautions against overfitting.) In its cpo.train step, the given Learner is trained on the incoming data and the resulting WrappedModel object is returned as the “control” object. This is given to the cpo.retrafo function, which performs prediction and creates a new classification Task with the match / mismatch between model prediction and ground truth as target. When an external Learner is trained on data that was preprocessed like this, its prediction will be whether the CPO-internal Learner can be trusted to predict a given data row. To “invert” this, i.e. to get the actual prediction, the cpo.invert function needs to have the internal Learner’s prediction as well as the prediction made by the external Learner. The former is provided by cpo.train.invert, which uses the WrappedModel to make a prediction on the new data, and given as control.invert to cpo.invert. The latter is the target data given to cpo.invert. This example CPO supports inverting both "response" and "prob" predict.type predictions, as declared in the predict.type.map argument. The actual predict.type to invert is given to cpo.invert as an argument.

xmpMetaLearn = makeCPOTargetOp("xmp.meta",  # nolint
  pSS(lrn: untyped),
  dataformat = "task",
  properties.target = c("classif", "twoclass"),
  predict.type.map = c(response = "response", prob = "prob"),
  cpo.train = function(data, target, lrn) {
    cat("*** cpo.train ***\n")
    lrn = setPredictType(lrn, "prob")
    train(lrn, data)
  },
  cpo.retrafo = function(data, target, control, lrn) {
    cat("*** cpo.retrafo ***\n")
    prediction = predict(control, target)
    tname = getTaskTargetNames(target)
    tdata = getTaskData(target)
    tdata[[tname]] = factor(prediction$data$response == prediction$data$truth)
    makeClassifTask(getTaskId(target), tdata, tname, positive = "TRUE",
      fixup.data = "no", check.data = FALSE)
  },
  cpo.train.invert = function(data, control, lrn) {
    cat("*** cpo.train.invert ***\n")
    predict(control, newdata = data)$data
  },
  cpo.invert = function(target, control.invert, predict.type, lrn) {
    cat("*** cpo.invert ***\n")
    if (predict.type == "prob") {
      outmat = as.matrix(control.invert[grep("^prob\\.", names(control.invert))])
      revmat = outmat[, c(2, 1)]
      outmat * target[, "prob.TRUE", drop = TRUE] +
        revmat * target[, "prob.FALSE", drop = TRUE]
    } else {
      stopifnot(levels(target) == c("FALSE", "TRUE"))
      numeric.prediction = as.numeric(control.invert$response)
      numeric.res = ifelse(target == "TRUE",
        numeric.prediction,
        3 - numeric.prediction)
      factor(levels(control.invert$response)[numeric.res],
        levels(control.invert$response))
    }
  })

cpo = xmpMetaLearn(makeLearner("classif.logreg"))

To show the inner workings of this CPO, the following example data is used.

set.seed(12)
split = makeResampleInstance(hout, pid.task)
train.task = subsetTask(pid.task, split$train.inds[[1]])
test.task = subsetTask(pid.task, split$predict.inds[[1]])

It can be instructive to watch the cat() output of this CPO to see which function gets called at what point in the lifecycle. The cpo.train function is called first to create the control object. The Task is transformed in cpo.retrafo. Also cpo.train.invert is called, since an inverter attribute is attached to the returned trafo.

trafd = train.task %>>% cpo
attributes(trafd)

The values of the target column (“diabetes”) of the result can be compared with the prediction of a "classif.logreg" Learner on the same data:

head(getTaskData(trafd))
model = train(makeLearner("classif.logreg", predict.type = "prob"), train.task)
head(predict(model, train.task)$data[c("truth", "response")])

When new data is transformed using the retrafo CPOTrained, another inverter attribute is created, and hence cpo.train.invert is called again. Since the target column of the test.task in the following example is also transformed, the cpo.retrafo function is called.

retr = test.task %>>% retrafo(trafd)
attributes(retr)

In a real world application, it would be possible for the new incoming data to have unknown target values. In that case, no target column would need to be changed, and cpo.retrafo is not called. The resulting data, retr.df, equals the input data with a retrafo attribute added.

retr.df = getTaskData(test.task, target.extra = TRUE)$data %>>% retrafo(trafd)
names(attributes(retr.df))

The invert functionality can be demonstrated by making a prediction with an external model.

ext.model = train("classif.svm", trafd)
ext.pred = predict(ext.model, retr)
newpred = invert(inverter(retr), ext.pred)
performance(newpred)

It may also be instructive to attach the xmpMetaLearn CPO to a Learner to see which functions get called during training and prediction of a TOCPO-Learner. Since the Learner does not do inversion of the training data, a CPOTrained for inversion is not created during training, and cpo.train.invert is hence not called. Only cpo.train (for control object creation) and cpo.retrafo (target value change) are called. During prediction, the input data is used to create an (internally used) inversion CPOTrained which promptly gets used by the prediction made by "classif.svm". Hence both cpo.train.invert and cpo.invert are called in succession.

cpo.learner = cpo %>>% makeLearner("classif.svm")
cpo.model = train(cpo.learner, train.task)
lrnpred = predict(cpo.model, test.task)
performance(lrnpred)

See Postscriptum for an evaluation of xmpMeatLearn’s performance.

Functional TOCPO

Just like for FOCPOs, it is possible to create functional TOCPOs. In the case of makeCPOTargetOp(), it is possible to have cpo.train create cpo.retrafo and cpo.train.invert, instead of giving them to makeCPOTargetOp() directly. Just as in makeCPO, these functions can then access the state of their environment in the cpo.train call and hence have neither a control argument, nor any arguments for the par.set parameters. Since cpo.train must in this case create two functions, these functions only need to be defined within cpo.train, the return value is ignored.

Note that cpo.retrafo and cpo.train.invert must either be both functional or both object based.

It is furthermore possible to return a cpo.invert function by cpo.train.invert, instead of giving it to makeCPOTargetOp(). As above, the returned function should not have any parameters for the ones given in par.set, and should not have a control.invert. cpo.invert can be functional or not, independently of whether cpo.retrafo and cpo.train.invert are functional.

As in makeCPO(), all functions that are given functionally must be explicitly set to NULL in the makeCPOTargetOp() call.

The xmpMetaLearn example above with functional cpo.retrafo, cpo.train.invert and cpo.invert would look like the following:


xmpMetaLearn = makeCPOTargetOp("xmp.meta.fnc",  # nolint
  pSS(lrn: untyped),
  dataformat = "task",
  properties.target = c("classif", "twoclass"),
  predict.type.map = c(response = "response", prob = "prob"),
  # set the cpo.* parameters not needed to NULL:
  cpo.retrafo = NULL, cpo.train.invert = NULL, cpo.invert = NULL,
  cpo.train = function(data, target, lrn) {
    cat("*** cpo.train ***\n")
    lrn = setPredictType(lrn, "prob")
    model = train(lrn, data)
    cpo.retrafo = function(data, target) {
      cat("*** cpo.retrafo ***\n")
      prediction = predict(model, target)
      tname = getTaskTargetNames(target)
      tdata = getTaskData(target)
      tdata[[tname]] = factor(prediction$data$response == prediction$data$truth)
      makeClassifTask(getTaskId(target), tdata, tname, positive = "TRUE",
        fixup.data = "no", check.data = FALSE)
    }
    cpo.train.invert = function(data) {
      cat("*** cpo.train.invert ***\n")
      prediction = predict(model, newdata = data)$data
      function(target, predict.type) {  # this is returned as cpo.invert
        cat("*** cpo.invert ***\n")
        if (predict.type == "prob") {
          outmat = as.matrix(prediction[grep("^prob\\.", names(prediction))])
          revmat = outmat[, c(2, 1)]
          outmat * target[, "prob.TRUE", drop = TRUE] +
            revmat * target[, "prob.FALSE", drop = TRUE]
        } else {
          stopifnot(levels(target) == c("FALSE", "TRUE"))
          numeric.prediction = as.numeric(prediction$response)
          numeric.res = ifelse(target == "TRUE",
            numeric.prediction,
            3 - numeric.prediction)
          factor(levels(prediction$response)[numeric.res],
            levels(prediction$response))
        }        
      }
    }
  })

Constant Invert TOCPOs

The example given above is a relatively elaborate TOCPO which needs information from the prediction data to perform inversion. Many simpler applications of target transformation do not need this information if their inversion step is independent of this data. It is possible to declare such a TOCPO using the constant.invert flag in makeCPOTargetOp(). If constant.invert is set to TRUE, the cpo.train.invert argument must be explicitly set to NULL. cpo.train still needs to have a control.invert argument; it is set to the value returned by cpo.train.

The following example is a TOCPO for regression Tasks that centers target values during training. After prediction, the data is inverted by adding the original mean of the training data to the predictions. This inversion operation does not need any information about the prediction data going in, so the TOCPO can be declared constant.invert.

The cpo.retrafo function is also called when new prediction data with a target column is transformed (as during model validation). In that case, the mean of the training data column is subtracted. Therefore the mean generated by cpo.train needs to be used in cpo.retrafo (i.e. the control value), not the mean of the target data present.

xmpRegCenter = makeCPOTargetOp("xmp.center",  # nolint
  constant.invert = TRUE,
  cpo.train.invert = NULL,  # necessary for constant.invert = TRUE
  dataformat = "df.feature",
  properties.target = "regr",
  cpo.train = function(data, target) {
    # control value is just the mean of the target column
    mean(target[[1]])
  },
  cpo.retrafo = function(data, target, control) {
    # subtract mean from target column in retrafo
    target[[1]] = target[[1]] - control
    target
  },
  cpo.invert = function(target, predict.type, control.invert) {
    target + control.invert
  })

cpo = xmpRegCenter()

To illustrate this CPO, the following data is used:

train.task = subsetTask(bh.task, 150:155)
getTaskTargets(train.task)
predict.task = subsetTask(bh.task, 156:160)
getTaskTargets(predict.task)

The target column of the task after transformation has a mean of 0.

trafd = train.task %>>% cpo
getTaskTargets(trafd)

When applying the retrafo CPOTrained to a new task, the mean of the training task target column is subtracted.

getTaskTargets(predict.task)
retr = retrafo(trafd)
predict.traf = predict.task %>>% retr
getTaskTargets(predict.traf)

When inverting a regression prediction, the mean of the training data target column is added to the prediction.

model = train("regr.lm", trafd)
pred = predict(model, predict.traf)
pred
invert(inverter(predict.traf), pred)

Since "regr.lm" is translation invariant and deterministic, the prediction equals the prediction made without centering the target:

model = train("regr.lm", train.task)
predict(model, predict.task)

A special property of constant.invert TOCPOs is that their retrafo CPOTrained can also be used for inversion. This is the case since the tight coupling of inversion operation to the data used to create the prediction is not necessary when the inversion is actually independent of this data. This is indicated by getCPOTrainedCapability() returning a vector with the "invert" capability set to 1. However, when using the retrafo CPOTrained for inversion, the “truth” column is absent from the inverted prediction.

getCPOTrainedCapability(retr)
invert(retr, pred)

Functional Constant Invert TOCPO

Just as above, constant.invert TOCPOs can be functional. For this, the cpo.train function must declare both a cpo.retrafo and a cpo.invert variable which perform the requested operations. These functions have no control or control.invert parameter, and no parameters pertaining to par.set.

Stateless TOCPO

Very simple target column operations that operate on a row-by-row basis without needing information e.g. from training data, can be declared as “stateless”. Similarly to makeCPO(), when cpo.train parameter is set to NULL, no control object is created for a CPOTrained. Furthermore, a stateless TOCPO must always have constant.invert set as well. Therefore, only cpo.retrafo and cpo.invert are given as functions, both without a control or control.invert argument. One example is a TOCPO that log-transforms the target column of a regression task, and exponentiates the predictions made from this during inversion. (A better inversion would take the "se" prediction into account, see cpoLogTrafoRegr.)

xmpLogRegr = makeCPOTargetOp("log.regr",  # nolint
  constant.invert = TRUE,
  properties.target = "regr",
  cpo.train = NULL, cpo.train.invert = NULL,
  cpo.retrafo = function(data, target) {
    target[[1]] = log(target[[1]])
    target
  },
  cpo.invert = function(target, predict.type) {
    exp(target)
  })

cpo = xmpLogRegr()

The CPO takes the logarithm of the task target column both during training and when using the retrafo CPOTrained.

trafd = train.task %>>% cpo
getTaskTargets(trafd)
retr = retrafo(trafd)
predict.traf = predict.task %>>% retr
getTaskTargets(predict.traf)
model = train("regr.lm", trafd)
pred = predict(model, predict.traf)
pred

Note that both the inverter and the retrafo CPOTrained can be used for inversion, since a stateless TOCPO also has constant.invert set. As above, when using the retrafo CPOTrained, the truth column is absent from the result.

invert(inverter(predict.traf), pred)
invert(retr, pred)

makeCPOExtendedTargetOp()

Just as for FOCPOs, it is possible to declare a TOCPO while having more direct control over what happens at which stage of training, re-transformation, or inversion. In a TOCPO defined with makeCPOTargetOp(), the cpo.retrafo and cpo.train.invert functions are called automatically when necessary during training and re-transformation. makeCPOExtendedTargetOp() instead has a cpo.trafo and a cpo.retrafo parameter, which get called during the respective operation.

cpo.trafo must be a function taking the same parameters as cpo.train in makeCPOTargetOp(). Instead of returning a control object, it must define a variable named “control”, and a variable named “control.invert”. The former is used as the control argument of cpo.retrafo, the latter is used as control.invert for cpo.invert when using the inverter CPOTrained created during training. The return value of cpo.trafo must be similar to the value returned by cpo.retrafo in makeCPOTargetOp(): it must be the modified data set or target, depending on dataformat.

cpo.retrafo must take the same parameters as in makeCPOTargetOp(). It must declare a control.invert variable that will be given to cpo.retrafo when using the inverter CPOTrained created during retransformation. Since cpo.retrafo is always called during retrafo CPOTrained application, a “target” column may or may not be present. If a target column is not present, the target parameter of cpo.retrafo is NULL and the return value of cpo.retrafo is ignored; otherwise it must be the transformed target value (which, as in makeCPOTargetOp(), can be a Task or data.frame of all data if dataformat is "task" or "df.all").

cpo.invert works just as in makeCPOTargetOp().

The following is a nonsensical, synthetic example that adds 1 to the target column of a regression Task during initial training, subtracts 1 during retrafo re-application and is a no-op during inversion.

xmpSynCPO = makeCPOExtendedTargetOp("syn.cpo",  # nolint
  properties.target = "regr",
  cpo.trafo = function(data, target) {
    cat("*** cpo.trafo ***\n")
    target[[1]] = target[[1]] + 1
    control = "control created in cpo.trafo"
    control.invert = "control.invert created in cpo.trafo"
    target
  },
  cpo.retrafo = function(data, target, control) {
    cat("*** cpo.retrafo ***", "control is:", deparse(control), sep = "\n")
    control.invert = "control.invert created in cpo.retrafo"
    if (!is.null(target)) {
      cat("target is non-NULL, performing transformation\n")
      target[[1]] = target[[1]] - 1
      return(target)
    } else {
      cat("target is NULL, no transformation (but control.invert was created)\n")
      return(NULL)  # is ignored.
    }
  },
  cpo.invert = function(target, control.invert, predict.type) {
    cat("*** invert ***", "control.invert is:", deparse(control.invert),
      sep = "\n")
    target
  })

cpo = xmpSynCPO()

For an “extended” TOCPO, only one of the transformation functions is called in each invocation. Initial transformation calls cpo.trafo and adds 1 to the targets; using the CPOTrained for re-transformation calls cpo.retrafo and subtracts 1.

trafd = train.task %>>% cpo
getTaskTargets(trafd)
retrafd = train.task %>>% retrafo(trafd)
getTaskTargets(retrafd)

It is also possible to perform re-transformation with a data.frame that does not include the target column. In that case the target value given to cpo.retrafo will be NULL, as reported by that function in this example:

retrafd = getTaskData(train.task, target.extra = TRUE)$data %>>% retrafo(trafd)

The trafd object has an inverter CPOTrained attribute that was created by cpo.trafo, the retrafd object has an inverter CPOTrained attribute created by cpo.retrafo (necessarily). This is made visible by the given example inverter function:

inv = invert(inverter(trafd), 1:6)
inv = invert(inverter(retrafd), 1:6)

Postscriptum

As an aside, the Learner enhanced by xmpMetaLearn seems to perform marginally better than either "classif.svm" or "classif.logreg" on their own for a large enough subset of pid.task (here resampled with output suppressed).

learners = list(
    logreg = makeLearner("classif.logreg"),
    svm = makeLearner("classif.svm"),
    cpo = xmpMetaLearn(makeLearner("classif.logreg")) %>>%
      makeLearner("classif.svm")
)

# suppress output of '*** cpo.train ***' etc.
configureMlr(show.info = FALSE, show.learner.output = FALSE)

perfs = sapply(learners, function(lrn) {
  unname(replicate(20, resample(lrn, pid.task, cv10)$aggr))
})

# reset mlr settings
configureMlr()

boxplot(perfs)

P-Values of comparing the CPOLearner to both "classif.logreg", and "classif.svm":

pvals = c(
    logreg = t.test(perfs[, "logreg"], perfs[, "cpo"], "greater")$p.value, 
    svm = t.test(perfs[, "svm"], perfs[, "cpo"], "greater")$p.value
)

round(p.adjust(pvals), 3)