Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPO: Composable Preprocessing Operators #1827

Merged
merged 246 commits into from
Dec 13, 2017
Merged

Conversation

mb706
Copy link
Contributor

@mb706 mb706 commented Jun 5, 2017

This is my GSoC project. See the preliminary vignette for a quick overview (more compact version with the R output removed).

Description for a General Audience

@everyone. If you have questions, ideas or feedback, please don't hesitate to write me, here or in other places!

What is this?

Functions for data manipulation and pre-processing a replacement for makePreprocWrapper lots of syntactic sugar.

Description

CPOs are called like functions and create an object that has Hyperparameters that can be manipulated using getHyperPars, setHyperPars etc.

> cpoPca()
pca(center = TRUE, scale = TRUE)
> cpoPca(scale = FALSE)
pca(center = TRUE, scale = FALSE)
> cpo = cpoPca(scale = FALSE)
> getHyperPars(cpo)
$center
[1] TRUE

$scale
[1] FALSE

> setHyperPars(cpo, center = FALSE)
pca(center = FALSE, scale = FALSE)

These objects can be applied to Tasks or data.frames to manipulate data, or can be attached to a Learner to create a wrapped learner similar to makePreprocWrapper.

> # PCA-rotate pid.task
> rotated.pid.task = pid.task %>>% cpoPca()
>
> # rotate a data.frame
> rotated.attitude = attitude %>>% cpoPca()
> # use the same rotation matrix on a
> # shorter version of the DF
> short.rotated.attitude = head(attitude) %>>% retrafo(rotated.attitude)
> all.equal(head(rotated.attitude), short.rotated.attitude,
+   check.attributes = FALSE)
[1] TRUE
> 
> # Centering / Scaling *after* PCA
> neoPCA = cpoPca(center = FALSE, scale = FALSE, id = "pca") %>>% cpoScale()
> neoPCA
(pca.pca >> scale)(pca.center = FALSE, pca.scale = FALSE, center = TRUE, scale = TRUE)
>
> # Attach the above to learner
> pcaLogreg = neoPCA %>>% makeLearner("classif.logreg")
> getHyperPars(pcaLogreg)
$model
[1] FALSE

$pca.center
[1] FALSE

$pca.scale
[1] FALSE

$center
[1] TRUE

$scale
[1] TRUE

Custom CPO constructors can be created using makeCPOObject or makeCPOFunctional. Note it is possible to write the (re)transformation operations with curly braces, with the function header getting added automatically.

> # bogus example, multiply first column
> cpomultiplier = makeCPOFunctional("multiplierF", factor = 1: numeric[~., ~.],
+   cpo.trafo = {  # implicit 'function(data, target, factor = 1)' here
+     data[[1]] = data[[1]] * factor
+     attr(data, "retrafo") = function(data) {
+       data[[1]] = data[[1]] / factor
+       data
+     }
+     data
+   })
> head(getTaskData(pid.task %>>% cpomultiplier(10000)))  # note first column
  pregnant glucose pressure triceps insulin mass pedigree age diabetes
1    60000     148       72      35       0 33.6    0.627  50      pos
2    10000      85       66      29       0 26.6    0.351  31      neg
3    80000     183       64       0       0 23.3    0.672  32      pos
4    10000      89       66      23      94 28.1    0.167  21      neg
5        0     137       40      35     168 43.1    2.288  33      pos
6    50000     116       74       0       0 25.6    0.201  30      neg

Implementation details

@berndbischl, @mllg

"CPO" (the name)

Would you want me to use "TaskTransform" instead of CPO (or something entirely different)?

makeParamSet Syntactic Sugar

I wrote a function paramSetSugar that makes creating ParamSets much less painful. Example

> paramSetSugar(a: logical, b: integer[0, 10], c: numeric[, ]^2)
           Type len Def      Constr Req Tunable Trafo
a       logical   -   -           -   -    TRUE     -
b       integer   -   -     0 to 10   -    TRUE     -
c numericvector   2   - -Inf to Inf   -    TRUE     -

Do you like this idea in general (maybe you want to incorporate it into ParamHelpers?) or would you rather not like me to use this in my project?

Object based vs. Functional CPOs

I implemented both, one in R/CPOObjectBased.R, the other in R/CPOFunctional.R; the code shared by both is mostly in R/CPOAuxiliary.R. Both have some advantages and disadvantages. The object based could use less memory in theory, since it does not carry around an environment in its model that usually contains the training data. It is also easier to debug if you like to use debugonce. In turn, the functional implementation can be applied directly to Task objects (since the CPO objects are just functions in this case) and could probably quite easily be coerced into collaborating with the magrittr package.
Maybe have a look at the concrete implementations of mine to see which one you like more.

Note About Object Based Implementation

The makePreprocWrapper implementation in mlr relies on the transformation function returning an object list(data = [data], control = [control]). I had the idea of just having it return the resulting data, and using R magic to inspect the function's environment to get at the control. See e.g. the implementation of cpoScale:

cpoScale = makeCPOObject("scale", center = TRUE: logical, scale = TRUE: logical,
  cpo.trafo = {
    ## boilerplate :-( Hope to get rid of this at some point
    targetdata = data[target]
    data[target] = NULL
    ## here we go
    result = scale(as.matrix(data), center = center, scale = scale)
    data[] = result
    data[target] = targetdata
    ## the 'control' object will be retrieved by the CPO machinery
    ## and given to cpo.retrafo
    control = list(center = attr(result, "scaled:center"),
      scale = attr(result, "scaled:scale"))
    data
  }, cpo.retrafo = {
    ## here we have the 'control' object
    as.data.frame(scale(as.matrix(data),
      center = control$center, scale = control$scale))
  })

What is your opinion about this? Alternatives are: Copying the entire cpo.trafo namespace to cpo.retrafo, so the user wouldn't need to worry about which variables are available and which are not. The downside to this: This would take the entire training data and save it inside the model, might be memory intensive. I could also stop being fancy and just return the list(data, control) as in makePreprocWrapper. There is a way to inspect cpo.retrafo and copy only the objects that are used by it, but this inspection is bound to be incomplete (the problem is halting problem equivalent) and could copy more data than the retrafo part needs.

Composition operator

I choose %>>%, since it is similar, but not used by, magrittr. It applies to CPO in conjunction with Learners on the right and Tasks on the left, but does not do Task %>>% CPO %>>% Learner because of the associativity problem.

State of implementation

The current roadmap, as I see it; comments?

  • makeParamSet syntactic sugar (R/ParamSetSugar.R)
  • Object based CPO (R/CPOObjectBased.R)
  • Functional CPO (R/CPOFunctional.R)
  • CPO composition
  • CPO attachment to Learner
  • CPO application to Task / data.frame
  • Easily do re-trafo on data after training on other data (i.e. perform the predict step without being attached to a Learner)
  • Basic documentation (needs polish)
  • Base functionality tests
  • Small example CPOs (R/CPO_concrete.R)
  • nicer API to avoid boilerplate code that is the same in most CPOs
  • properties handling
  • CPO multiplexer
  • cpoCbind
  • metaCPO
  • list all built-in CPOs
  • select features to affect / ignore
  • Actually useful CPOs
  • Testing concrete CPOs
  • export / fix hyperparameters
  • cpoCbind avoid copying redundant columns
  • target CPO

Maybe some day...

  • Task weights
  • Task blocking

@lintr-bot
Copy link

R/measures.R:1435:73: style: Use FALSE instead of the symbol F.

perror = pec(probs, f, data = newdata[, tn], times = grid, exact = F, exactness = 99L,
                                                                       ~^

@larskotthoff
Copy link
Sponsor Member

What's the status here? Does this still need to merged for mlrCPO to work?

@mb706
Copy link
Contributor Author

mb706 commented Dec 13, 2017

Yes please, mlrCPO relies on a few kinda-internal functions of mlr. Should be pretty stable as well (my last few commits were just running after merge conflicts).

@larskotthoff
Copy link
Sponsor Member

Thanks, merging.

@larskotthoff larskotthoff merged commit 0c0bbe7 into master Dec 13, 2017
@larskotthoff larskotthoff deleted the ComposablePreprocOperators branch December 13, 2017 18:17
zmjones pushed a commit that referenced this pull request Dec 19, 2017
* Introducing Composable Preprocessing Objects.

* ParamSet syntactic sugar

* Make git ignore emacs temp files

* Bugfixes in ParamSetSugar

* Automatically generate function from braced expressions

* lintr fixes

* Creation of CPOObject

* Nice printing

* CPOObject concatenation

* CPO composition

* Composition operator also for attachment

* wrapping seems to work now

* some experiments

* Some reorganizing

* Implemented CPOFunctional, against all odds. Probably full of bugs still.

* Reorg: Organize CPOObjectBased, CPOFunctional the same

* Bugfixes

* CPOObject now doesn't need to return 'control', just create it.

* lintr

* ParamSetSugar test

* Indentation

* Testing most of CPO, excluding ParamSet feasibility checks

* Bugfixes, found through tests

* setHyperPars: assert uniquely named parameters

* Testing hyperparameter feasibility

* Test parameter feasibility

* Testing actual data transformation

* Testing CPO trafo functions

* Testing requirement handling

* Forgive absence of parameters with unfulfilled requirements

* Requirement handling when changing ID

* Repair global var problems in S3 methods in cpo tests

* lintr

* Application operator

* Corrected copy-paste caused typo

* Inform user when he forgets to construct CPO

* Documentation

* Make R CMD check --no-test happy

* lintr doesnt recognize CPO function definitions as functions

* Retrafo set / access functions

* paramSetSugar parameter pss.* now have dot prefix
for R param matching reasons

* retrafo() machinery

* Functional CPO now uses retrafo()

* Roxygenise

* Tests work again

* Bugfixes

* More informative error messages

* More informative error messages

* Testing for error handling

* Embarrassing!

* static analyzer safe paramSetSugar

* Using NA instead of dot to indicate missing parameter

* Cleaning up documentation

* Documentation fixes

* lintr

* Turn chain of preprocs into list, and assemble list into chain

* roxygenize

* Put common CPO test objects into helper_cpo.R

* Refactor chainung and un-chaining

* Chaining, unchaining of object based retrafos

* use 'predict' to apply retrafos

* lint

* Adding get / set retrafo state functionality

* Adding get state and makefromstate for object based

* Cleaning up CPO object based

* Testing for retrafo state

* Cleaning up CPOFunctional

* Adding get state and makefromstate for functional based

* lintr

* R CMD check

* Small test correction
Evidently I should clear my .GlobalEnv before running tests.

* small comment change

* retrafo assignment now checks for type, not function

* Adding properties parameters

* Added docu, todo

* Starting task shape verification things

* Get format check its own file

* Auxiliary files reorg

* changed object based callCPO[Re]Trafo, need to propagate the changes now

* One more step towards properties & data shape checking

* Tests pass again

* cleaning up

* Make lint approve of TODOs temporarily

* Tests first half of target type functionality

* checkLearnerBeforeTrain: Wrong error message when unordered not supported

* lintr

* NOOP

* Finished datasplit tests

* Most property tests are done

* get CPO from learner

* Ported properties and datasplit to functional

* Tests pass

* roxygenise

* Make tests faster

* travis timeout ++

* Travis timeout +++

* Rewritten CPO core. A beauty to behold!
This removes 'makeCPOFunctional' and 'makeCPOObject' and replaces
them both with 'makeCPO'.

* Travis timeout ++++

* Added 'factor', 'ordered', 'onlyfactor', 'numeric' datasplit
Numeric splits also support matrix instead of data frame

* Introducing NULLCPO, the neutral element of the CPO monad

* Starting targetbound CPO

* Targetbound CPO Task conversion backend

* static code analyser found bugs

* Making big steps towards target CPOs

* to-do list, travis timeout ++ again

* Pretty much done with target-bound CPO

* Introducing stateless CPOs

* is.nullcpo

* Completing stateless

* Roxygenise

* Tests pass

* lintr

* stateless trafo-less CPO

* ShapeInfo printing

* Nicer ShapeInfo printing

* More generics for getting CPO information

* Multiplexer, Applicator

* Renaming test files

* Split up test_cpo_datasplit into *_datasplit and *_properties

* Checking par.vals availability at the right places

* Proper datasplit numeric / factor / etc handling

* New tests

* Datasplit numeric, factor, ordered, onlyfactor finally seem to work

* Finished datasplit tests

* summary bug

* example CPOs handle DFs containing non-numeric columns

* repair summary

* Accept NA vector length

* Accept character vectors for discrete character params

* cpoSelect CPO

* cpoSelect Params Reorg

* cpoCbind

* Check more rigorously that CPOs don't get called too often.

* Test cpoCbind with tasks

* listCPO

* Testing concrete CPOs so far

* Fix retrafo column name test

* fix.factors

* dummy encoder

* Column selection by name

* invert option for cpoSelect

* Starting to implement affect.*

* Interpreting subset

* Fixing some bugs, implementing some tests, for affect subset

* Don't print meta-params for CPO constructors

* Tests for affect.* done

* Collect meta-CPOs in a separate file

* Finishing cpoMeta and its tests

* lintr

* Export cpoMeta

* Fix summary bug

* Export some functions I forgot to export

* Adding jupyter vignette

* Adding html rendered version of vignette

* update .gitignore

* Compact html vignette

* Fixing test bug for impute

* Impute CPO + tests

* Adding specialised CPO imputers

* CPO Imputers get their own file

* lintr

* Test that dummys are not created when the flag says so.

* Updating Vignette

* checkMeasures: instead of missing(), use NULL

* Feature Filters

* Introducing applyCPO: apply a CPO to a Task / df

* Introduce composeCPO: composing two CPOs

* Introducing attachCPO: Attaching a CPO to a learner.

* Adjust properties of imputers that can only handle certain types

* Forgot export

* filter features now only operate on the columns of the right type

* Constant Feature Remover CPO

* CPO for fixing factors

* cpoDummyEncode works much better now

* MissingIndicators CPO

* Repairing checkMeasures

* Better travis check

* cpoCbind bugfixes

* bugfix

* Vignette updates

* use cases ipynb

* Recursive application of CPO fix

* roxygenise

* Avoid warnings when load_all-ing mlr

* Now possible to specify packages associated with CPOs

* Bugfix

* Fix blackboost bug
TODO: report this

* Revert "Avoid warnings when load_all-ing mlr"

This reverts commit ba2cf73.

Reverting this because I made an extra pull-request

* Revert "Fix blackboost bug"

This reverts commit 9e1007f.

Reverting this b/c I made an extra PR

* Fix imputation of empty df bug

* cpoScale fix for 1-column data

* cpoApplyFun

* cpoRangeScale

* cpoProbEncode

* Impact encoding

* Adding new CPOs

* rename cpoRangeScale -> cpoScaleRange

* More natural handling of 'stateless' cpo

* .retrafo.format added, with new 'combined' option

* optionally only export subset of parameters

* fix export

* removing 'CPOS3Primitive' class, as ordered

* removing 'CPOS3Constructor' class, as ordered

* removing 'CPOS3RetrafoPrimitive', as ordered

* simplifying class structure, as ordered

* simplifying class structure further

* done simplifying class structure

* makeCPO documentation

* Handle ID correctly for non-exported values

* Pretty printing; Bugfixes

* A few new CPOs

* cpoSpatialSign

* tuning CPO test

* tuning CPO test II

* bugfixes

* roxygenise

* Vignette: Examples, CPOs, Construction

* cleaning up vignette

* vignette html export

* Bugfix

* Tuning vignette

* vignette

* exporting necessary things

* deleting superfluous files which are now in mlrCPO

* forgot necessary function

* Export makeBaseWrapper

* Removing paramSetSugar

* Roxygenise

* Exporting changeData

* Removing the last few bits from before the cpo-mlr-split

* Export checkPredictLearnerOutput for mlrCPO

* add 'keywords internal' to internal use only functions

* roxygenise

* overlooked while merging

* %%
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants