CPO: Composable Preprocessing Operators #1827

mb706 · 2017-06-05T20:57:45Z

This is my GSoC project. See the preliminary vignette for a quick overview (more compact version with the R output removed).

Description for a General Audience

@everyone. If you have questions, ideas or feedback, please don't hesitate to write me, here or in other places!

What is this?

Functions for data manipulation and pre-processing ➕ a replacement for makePreprocWrapper ➕ lots of syntactic sugar.

Description

CPOs are called like functions and create an object that has Hyperparameters that can be manipulated using getHyperPars, setHyperPars etc.

> cpoPca()
pca(center = TRUE, scale = TRUE)
> cpoPca(scale = FALSE)
pca(center = TRUE, scale = FALSE)
> cpo = cpoPca(scale = FALSE)
> getHyperPars(cpo)
$center
[1] TRUE

$scale
[1] FALSE

> setHyperPars(cpo, center = FALSE)
pca(center = FALSE, scale = FALSE)

These objects can be applied to Tasks or data.frames to manipulate data, or can be attached to a Learner to create a wrapped learner similar to makePreprocWrapper.

> # PCA-rotate pid.task
> rotated.pid.task = pid.task %>>% cpoPca()
>
> # rotate a data.frame
> rotated.attitude = attitude %>>% cpoPca()
> # use the same rotation matrix on a
> # shorter version of the DF
> short.rotated.attitude = head(attitude) %>>% retrafo(rotated.attitude)
> all.equal(head(rotated.attitude), short.rotated.attitude,
+   check.attributes = FALSE)
[1] TRUE
> 
> # Centering / Scaling *after* PCA
> neoPCA = cpoPca(center = FALSE, scale = FALSE, id = "pca") %>>% cpoScale()
> neoPCA
(pca.pca >> scale)(pca.center = FALSE, pca.scale = FALSE, center = TRUE, scale = TRUE)
>
> # Attach the above to learner
> pcaLogreg = neoPCA %>>% makeLearner("classif.logreg")
> getHyperPars(pcaLogreg)
$model
[1] FALSE

$pca.center
[1] FALSE

$pca.scale
[1] FALSE

$center
[1] TRUE

$scale
[1] TRUE

Custom CPO constructors can be created using makeCPOObject or makeCPOFunctional. Note it is possible to write the (re)transformation operations with curly braces, with the function header getting added automatically.

> # bogus example, multiply first column
> cpomultiplier = makeCPOFunctional("multiplierF", factor = 1: numeric[~., ~.],
+   cpo.trafo = {  # implicit 'function(data, target, factor = 1)' here
+     data[[1]] = data[[1]] * factor
+     attr(data, "retrafo") = function(data) {
+       data[[1]] = data[[1]] / factor
+       data
+     }
+     data
+   })
> head(getTaskData(pid.task %>>% cpomultiplier(10000)))  # note first column
  pregnant glucose pressure triceps insulin mass pedigree age diabetes
1    60000     148       72      35       0 33.6    0.627  50      pos
2    10000      85       66      29       0 26.6    0.351  31      neg
3    80000     183       64       0       0 23.3    0.672  32      pos
4    10000      89       66      23      94 28.1    0.167  21      neg
5        0     137       40      35     168 43.1    2.288  33      pos
6    50000     116       74       0       0 25.6    0.201  30      neg

Implementation details

@berndbischl, @mllg

"CPO" (the name)

Would you want me to use "TaskTransform" instead of CPO (or something entirely different)?

makeParamSet Syntactic Sugar

I wrote a function paramSetSugar that makes creating ParamSets much less painful. Example

> paramSetSugar(a: logical, b: integer[0, 10], c: numeric[, ]^2)
           Type len Def      Constr Req Tunable Trafo
a       logical   -   -           -   -    TRUE     -
b       integer   -   -     0 to 10   -    TRUE     -
c numericvector   2   - -Inf to Inf   -    TRUE     -

Do you like this idea in general (maybe you want to incorporate it into ParamHelpers?) or would you rather not like me to use this in my project?

Object based vs. Functional CPOs

I implemented both, one in R/CPOObjectBased.R, the other in R/CPOFunctional.R; the code shared by both is mostly in R/CPOAuxiliary.R. Both have some advantages and disadvantages. The object based could use less memory in theory, since it does not carry around an environment in its model that usually contains the training data. It is also easier to debug if you like to use debugonce. In turn, the functional implementation can be applied directly to Task objects (since the CPO objects are just functions in this case) and could probably quite easily be coerced into collaborating with the magrittr package.
Maybe have a look at the concrete implementations of mine to see which one you like more.

Note About Object Based Implementation

The makePreprocWrapper implementation in mlr relies on the transformation function returning an object list(data = [data], control = [control]). I had the idea of just having it return the resulting data, and using R magic to inspect the function's environment to get at the control. See e.g. the implementation of cpoScale:

cpoScale = makeCPOObject("scale", center = TRUE: logical, scale = TRUE: logical,
  cpo.trafo = {
    ## boilerplate :-( Hope to get rid of this at some point
    targetdata = data[target]
    data[target] = NULL
    ## here we go
    result = scale(as.matrix(data), center = center, scale = scale)
    data[] = result
    data[target] = targetdata
    ## the 'control' object will be retrieved by the CPO machinery
    ## and given to cpo.retrafo
    control = list(center = attr(result, "scaled:center"),
      scale = attr(result, "scaled:scale"))
    data
  }, cpo.retrafo = {
    ## here we have the 'control' object
    as.data.frame(scale(as.matrix(data),
      center = control$center, scale = control$scale))
  })

What is your opinion about this? Alternatives are: Copying the entire cpo.trafo namespace to cpo.retrafo, so the user wouldn't need to worry about which variables are available and which are not. The downside to this: This would take the entire training data and save it inside the model, might be memory intensive. I could also stop being fancy and just return the list(data, control) as in makePreprocWrapper. There is a way to inspect cpo.retrafo and copy only the objects that are used by it, but this inspection is bound to be incomplete (the problem is halting problem equivalent) and could copy more data than the retrafo part needs.

Composition operator

I choose %>>%, since it is similar, but not used by, magrittr. It applies to CPO in conjunction with Learners on the right and Tasks on the left, but does not do Task %>>% CPO %>>% Learner because of the associativity problem.

State of implementation

The current roadmap, as I see it; comments?

Maybe some day...

Task weights
Task blocking

…ill.

lintr-bot · 2017-09-04T14:58:51Z

R/measures.R:1435:73: style: Use FALSE instead of the symbol F.

    perror = pec(probs, f, data = newdata[, tn], times = grid, exact = F, exactness = 99L,
                                                                       ~^

larskotthoff · 2017-12-13T17:46:57Z

What's the status here? Does this still need to merged for mlrCPO to work?

mb706 · 2017-12-13T17:48:49Z

Yes please, mlrCPO relies on a few kinda-internal functions of mlr. Should be pretty stable as well (my last few commits were just running after merge conflicts).

larskotthoff · 2017-12-13T18:17:40Z

Thanks, merging.

* Introducing Composable Preprocessing Objects. * ParamSet syntactic sugar * Make git ignore emacs temp files * Bugfixes in ParamSetSugar * Automatically generate function from braced expressions * lintr fixes * Creation of CPOObject * Nice printing * CPOObject concatenation * CPO composition * Composition operator also for attachment * wrapping seems to work now * some experiments * Some reorganizing * Implemented CPOFunctional, against all odds. Probably full of bugs still. * Reorg: Organize CPOObjectBased, CPOFunctional the same * Bugfixes * CPOObject now doesn't need to return 'control', just create it. * lintr * ParamSetSugar test * Indentation * Testing most of CPO, excluding ParamSet feasibility checks * Bugfixes, found through tests * setHyperPars: assert uniquely named parameters * Testing hyperparameter feasibility * Test parameter feasibility * Testing actual data transformation * Testing CPO trafo functions * Testing requirement handling * Forgive absence of parameters with unfulfilled requirements * Requirement handling when changing ID * Repair global var problems in S3 methods in cpo tests * lintr * Application operator * Corrected copy-paste caused typo * Inform user when he forgets to construct CPO * Documentation * Make R CMD check --no-test happy * lintr doesnt recognize CPO function definitions as functions * Retrafo set / access functions * paramSetSugar parameter pss.* now have dot prefix for R param matching reasons * retrafo() machinery * Functional CPO now uses retrafo() * Roxygenise * Tests work again * Bugfixes * More informative error messages * More informative error messages * Testing for error handling * Embarrassing! * static analyzer safe paramSetSugar * Using NA instead of dot to indicate missing parameter * Cleaning up documentation * Documentation fixes * lintr * Turn chain of preprocs into list, and assemble list into chain * roxygenize * Put common CPO test objects into helper_cpo.R * Refactor chainung and un-chaining * Chaining, unchaining of object based retrafos * use 'predict' to apply retrafos * lint * Adding get / set retrafo state functionality * Adding get state and makefromstate for object based * Cleaning up CPO object based * Testing for retrafo state * Cleaning up CPOFunctional * Adding get state and makefromstate for functional based * lintr * R CMD check * Small test correction Evidently I should clear my .GlobalEnv before running tests. * small comment change * retrafo assignment now checks for type, not function * Adding properties parameters * Added docu, todo * Starting task shape verification things * Get format check its own file * Auxiliary files reorg * changed object based callCPO[Re]Trafo, need to propagate the changes now * One more step towards properties & data shape checking * Tests pass again * cleaning up * Make lint approve of TODOs temporarily * Tests first half of target type functionality * checkLearnerBeforeTrain: Wrong error message when unordered not supported * lintr * NOOP * Finished datasplit tests * Most property tests are done * get CPO from learner * Ported properties and datasplit to functional * Tests pass * roxygenise * Make tests faster * travis timeout ++ * Travis timeout +++ * Rewritten CPO core. A beauty to behold! This removes 'makeCPOFunctional' and 'makeCPOObject' and replaces them both with 'makeCPO'. * Travis timeout ++++ * Added 'factor', 'ordered', 'onlyfactor', 'numeric' datasplit Numeric splits also support matrix instead of data frame * Introducing NULLCPO, the neutral element of the CPO monad * Starting targetbound CPO * Targetbound CPO Task conversion backend * static code analyser found bugs * Making big steps towards target CPOs * to-do list, travis timeout ++ again * Pretty much done with target-bound CPO * Introducing stateless CPOs * is.nullcpo * Completing stateless * Roxygenise * Tests pass * lintr * stateless trafo-less CPO * ShapeInfo printing * Nicer ShapeInfo printing * More generics for getting CPO information * Multiplexer, Applicator * Renaming test files * Split up test_cpo_datasplit into *_datasplit and *_properties * Checking par.vals availability at the right places * Proper datasplit numeric / factor / etc handling * New tests * Datasplit numeric, factor, ordered, onlyfactor finally seem to work * Finished datasplit tests * summary bug * example CPOs handle DFs containing non-numeric columns * repair summary * Accept NA vector length * Accept character vectors for discrete character params * cpoSelect CPO * cpoSelect Params Reorg * cpoCbind * Check more rigorously that CPOs don't get called too often. * Test cpoCbind with tasks * listCPO * Testing concrete CPOs so far * Fix retrafo column name test * fix.factors * dummy encoder * Column selection by name * invert option for cpoSelect * Starting to implement affect.* * Interpreting subset * Fixing some bugs, implementing some tests, for affect subset * Don't print meta-params for CPO constructors * Tests for affect.* done * Collect meta-CPOs in a separate file * Finishing cpoMeta and its tests * lintr * Export cpoMeta * Fix summary bug * Export some functions I forgot to export * Adding jupyter vignette * Adding html rendered version of vignette * update .gitignore * Compact html vignette * Fixing test bug for impute * Impute CPO + tests * Adding specialised CPO imputers * CPO Imputers get their own file * lintr * Test that dummys are not created when the flag says so. * Updating Vignette * checkMeasures: instead of missing(), use NULL * Feature Filters * Introducing applyCPO: apply a CPO to a Task / df * Introduce composeCPO: composing two CPOs * Introducing attachCPO: Attaching a CPO to a learner. * Adjust properties of imputers that can only handle certain types * Forgot export * filter features now only operate on the columns of the right type * Constant Feature Remover CPO * CPO for fixing factors * cpoDummyEncode works much better now * MissingIndicators CPO * Repairing checkMeasures * Better travis check * cpoCbind bugfixes * bugfix * Vignette updates * use cases ipynb * Recursive application of CPO fix * roxygenise * Avoid warnings when load_all-ing mlr * Now possible to specify packages associated with CPOs * Bugfix * Fix blackboost bug TODO: report this * Revert "Avoid warnings when load_all-ing mlr" This reverts commit ba2cf73. Reverting this because I made an extra pull-request * Revert "Fix blackboost bug" This reverts commit 9e1007f. Reverting this b/c I made an extra PR * Fix imputation of empty df bug * cpoScale fix for 1-column data * cpoApplyFun * cpoRangeScale * cpoProbEncode * Impact encoding * Adding new CPOs * rename cpoRangeScale -> cpoScaleRange * More natural handling of 'stateless' cpo * .retrafo.format added, with new 'combined' option * optionally only export subset of parameters * fix export * removing 'CPOS3Primitive' class, as ordered * removing 'CPOS3Constructor' class, as ordered * removing 'CPOS3RetrafoPrimitive', as ordered * simplifying class structure, as ordered * simplifying class structure further * done simplifying class structure * makeCPO documentation * Handle ID correctly for non-exported values * Pretty printing; Bugfixes * A few new CPOs * cpoSpatialSign * tuning CPO test * tuning CPO test II * bugfixes * roxygenise * Vignette: Examples, CPOs, Construction * cleaning up vignette * vignette html export * Bugfix * Tuning vignette * vignette * exporting necessary things * deleting superfluous files which are now in mlrCPO * forgot necessary function * Export makeBaseWrapper * Removing paramSetSugar * Roxygenise * Exporting changeData * Removing the last few bits from before the cpo-mlr-split * Export checkPredictLearnerOutput for mlrCPO * add 'keywords internal' to internal use only functions * roxygenise * overlooked while merging * %%

mb706 added 30 commits June 3, 2017 14:55

Introducing Composable Preprocessing Objects.

ac03d22

ParamSet syntactic sugar

aa2d283

Make git ignore emacs temp files

d569590

Bugfixes in ParamSetSugar

a005016

Automatically generate function from braced expressions

7443300

lintr fixes

ac8e620

Creation of CPOObject

1a704ee

Nice printing

7349cfb

CPOObject concatenation

fc07a79

CPO composition

a5beba8

Composition operator also for attachment

307a54d

wrapping seems to work now

c91090f

some experiments

3fdbba9

Some reorganizing

6d27fe3

Implemented CPOFunctional, against all odds. Probably full of bugs st…

ce63c0e

…ill.

Reorg: Organize CPOObjectBased, CPOFunctional the same

c767f7a

Bugfixes

fe26b4f

CPOObject now doesn't need to return 'control', just create it.

9132f86

lintr

222b08a

ParamSetSugar test

e87c580

Indentation

5963c0f

Testing most of CPO, excluding ParamSet feasibility checks

52d6988

Bugfixes, found through tests

d41e287

setHyperPars: assert uniquely named parameters

dc8c342

Testing hyperparameter feasibility

9ba3d4b

Test parameter feasibility

3ab44f1

Testing actual data transformation

0375e41

Testing CPO trafo functions

d461e77

Testing requirement handling

f35f3ee

Forgive absence of parameters with unfulfilled requirements

21303b7

mb706 added 12 commits August 4, 2017 13:26

Bugfix

6d459ee

Merge branch 'master' into ComposablePreprocOperators

9d81764

Tuning vignette

e7310dd

vignette

7398595

exporting necessary things

5a1a725

deleting superfluous files which are now in mlrCPO

31ae1b0

forgot necessary function

12d4b06

Export makeBaseWrapper

9dc4dc6

Merge branch 'master' into ComposablePreprocOperators

a161c5d

Removing paramSetSugar

be84fae

Roxygenise

be896ec

Exporting changeData

99151fc

mb706 added 6 commits November 13, 2017 16:57

Merge branch 'master' into ComposablePreprocOperators

d280768

Removing the last few bits from before the cpo-mlr-split

46232a2

Export checkPredictLearnerOutput for mlrCPO

c560097

add 'keywords internal' to internal use only functions

03d2c3c

roxygenise

4ecc1dd

Merge branch 'master' into ComposablePreprocOperators

7044b17

mb706 force-pushed the ComposablePreprocOperators branch from 3660535 to 7044b17 Compare December 4, 2017 17:48

mb706 added 4 commits December 4, 2017 19:57

overlooked while merging

173b470

%%

48a75df

Merge branch 'master' into ComposablePreprocOperators

28d1bda

Merge branch 'master' into ComposablePreprocOperators

0d4a2a6

larskotthoff merged commit 0c0bbe7 into master Dec 13, 2017

larskotthoff deleted the ComposablePreprocOperators branch December 13, 2017 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPO: Composable Preprocessing Operators #1827

CPO: Composable Preprocessing Operators #1827

mb706 commented Jun 5, 2017 •

edited

Loading

lintr-bot commented Sep 4, 2017

larskotthoff commented Dec 13, 2017

mb706 commented Dec 13, 2017

larskotthoff commented Dec 13, 2017

CPO: Composable Preprocessing Operators #1827

CPO: Composable Preprocessing Operators #1827

Conversation

mb706 commented Jun 5, 2017 • edited Loading

Description for a General Audience

What is this?

Description

Implementation details

"CPO" (the name)

makeParamSet Syntactic Sugar

Object based vs. Functional CPOs

Note About Object Based Implementation

Composition operator

State of implementation

lintr-bot commented Sep 4, 2017

larskotthoff commented Dec 13, 2017

mb706 commented Dec 13, 2017

larskotthoff commented Dec 13, 2017

mb706 commented Jun 5, 2017 •

edited

Loading