Machine Learning framework package idea (machineRy)

Hello,

I’ve spent the last year getting well-acquainted with R and it’s machine learning libraries by developing a couple of projects of personal interest. As I’ve done so, I have been impressed with how well S3 works to allow different machine learning libraries can be utilized with a common set of functions (like predict()).

As I’ve developed my own algorithms and code, I started to build a framework that makes it easier for me to prepare a data set, build a machine, train, test, and visualize the results. Of course, my use cases are naturally limited - I’m focused mostly on regression using support vector machine, random forest, glm, etc. But the framework, as I’ve developed it, seems like it might be overall more useful. I’m about 1/3rd of the way through early development, but I’ve named it “machineRy” and have built a skeleton package.

The framework as I’ve developed it, is broken into three key components: data, machine, and plot.

Each component represents a key element of the machine learning process: preparing the data, machine training and prediction, and managing graphical plots / text output.

The advantage to using a framework is that it “canonizes” each component to make it easy to customize each step while ensuring consistent performance. For example, the data preparation may involve transforming variables with non-guassian distributions, scaling / centering, expanding factor variables into 1-of-n encodings, etc. Any (or all) of these steps could be easily implemented or omitted while preserving correct order (e.g., scaling / centering should follow variable transformation). Additionally, user-defined functions could be substituted into any of these steps (e.g., implementing a BoxCox transform rather than the default log transform), or additional user-defined functions could be inserted into the overall process.

Likewise, the machine component would have the same modularity to it, allowing mixing-and-matching potentially incompatible machines to create ensembles. In my case, I’ve built an ensemble which consisted of a couple machines from the H2O R library, the svm model from the e1071 package, glm, etc. But implementing an ensemble (like a stacked generalizer) with disparate machines such as these requires some work to ensure common output as well as necessary manipulation as results are passed between machines. Working out these sorts of issue are primarily what I’m interested in addressing with this framework.

I’ll stop with that. I fully expect to implement this framework because my code is otherwise becoming unmanageable and it’s getting difficult to build and test new machines as I try to improve model performance. That said, I thought that it might be of use to others, but before I presume as much, I felt it’d be worth vetting the idea and really exploring whether my needs represent those of a greater whole or are just the product of my niche of research…

Hey there,
Machine learning in R and particularly ensemble learning via stacking in R is something I have spent a lot of time working on. There are a few packages that already do some or all of what you are describing that you may want to check out.

For general packages that wrap a bunch of ML algorithms and also include pre-processing components, the two that stand out are caret and mlr. Caret is very well developed (and actively developed) and has been around for a long time. mlr is newer and implements a different type of API (compared to other R ML packages, but it’s intuitive). I think it looks very promising and is actively developed.

As for model stacking, there are several packages I can recommend:

  • SuperLearner: The original stacked ensemble package in R, created in 2010. This package implements a general framework for stacking models in R via linear combinations (stacking is also referred to as Super Learner algorithm at Berkeley and elsewhere). You can use one of the 30+ existing algorithms or write your own wrapper functions. If you have reasonably small data, this package works well. I am one of the maintainers of this package.
  • subsemble: I wrote this package. It implements stacking via the traditional Super Learner algorithm and also implements the Subsemble algorithm, which is a variant of stacking that trains on subsets/partitions of the data, which can allow for a big speed up in training. The way the parallelism is implemented in subsemble is (I think) more efficient than how it’s implemented in the SuperLearner package, which is part of the reason I made this a separate package from SuperLearner. It also supports using any function as the metalearner, instead of just using linear combinations.
  • h2oEnsemble: I wrote this package as a way to scale ensemble learning in R further than what is possible with single threaded algorithms, like are used in SuperLearner and subsemble. Using H2O for the modeling speeds up training immensely and also allows for very large training sets.
  • caretEnsemble: Check out the caretStack function. If your models are caret models, then this is the obvious choice. However, the last time I checked, this package was using out-of-bag predictions rather than cv predictions to train the metalearner. This will result in a speed up in training, but probably less accurate ensmble models. The other limitation is that you can only create linear combinations of learners.
  • mlr: Check out the makeStackedLearner function. I believe that this implements the Super Learner algorithm true to form.

Let me know if you are interested in using or contributing to on any of these packages! I am curious to hear if these will fit your use case.

Best,
Erin

Thanks for the response. If I can avoid reinventing the wheel, I’d certainly prefer to do that. Just glancing at your descriptions, SuperLearner / subsemble is the sort of thing I am trying to produce, though I think I had more of subsemble’s goal in mind.

I wanted to create a framework that wraps existing ML packages and allows them to be used with each other in stacked / ensemble ways (to repeat myself). I like H2O, but it’s unstable on Windows, I’ve found, so I use it sparingly, and have often combined H2O machines with more vanilla machines (like e1071::svm). One of my secondary goals was to provide some minimum level of multithreaded support for any given machine (e.g., using foreach %dopar% in k-fold applications). Really though, just creating a framework that standardizes machine input and output, and provides a consistent, customizeable way to prepare and visualize data was the goal, so I don’t have to reinvent these support functions every time I try to implement a different machine architecture.

I’ll certainly take a closer look at what you’ve suggested.

Thanks.

Hi again,
Yeah it sounds like SuperLearner or subsemble would suit your needs. It’s certainly possible to use H2O models in SuperLearner/subsemble, so if you want the flexibility of being able to use any algorithm in your ensemble, then either of those packages would be good. SuperLearner and subsemble use the same algorithm wrappers (which provides the unified interface that you are looking for), so you can try out both packages easily. There are existing wrappers for svm, glm, etc, that you can use out-of-the-box and it’s pretty easy to create custom wrappers.

I am interested to get your feedback on subsemble in particular if you try it out (especially the parallel training functionality on Windows). The README is the best place to start. There are a few things I think can be improved related to memory use / copies of data, and I’m very open to experimenting with new functionality – especially different parallel backends. If you want to read more about my experiments with these packages, there is a lot of info in my thesis, available here.

Sorry to hear you are having issues with H2O stability on Windows – is that with certain algorithms, or with the software in general? I work at H2O.ai (the company), so if you want to provide more info about the issues you’ve been having, I’ll be sure that you get the support you need. This is our oss support group where you can get support for any issues you may be having: https://groups.google.com/forum/#!forum/h2ostream We are rapidly developing H2O, so if you haven’t tried it out lately, some/all of the issues you were having may already be addressed.

Feel free to reach out at oss at ledell.org.

Best,
Erin

Erin,

I’m not sure what, exactly, is the problem with H2O, but if I set up a cluster on my local machine, it has a tendency to just shut down randomly. It doesn’t happen frequently, and at first I had suspected it had to do with Windows locking my machine if I let it sit idle long enough, but I had a cluster shut down on my while I was using my machine, so I doubt that’s the cause.

In any case, I work in a somewhat constrained environment, technologically speaking. I tried setting up an EC2 cluster to see if I could work around it, but I’m running into network issues that I doubt the management would be willing to resolve for me. What I wouldn’t give for a Linux machine around here… :slightly_smiling:

Apart from that, H2O’s been a pretty nice tool to use. The R API has been a lifesaver.

I’ll definitely take some time to get a feel for SuperLearner / subsemble.

Thanks again.

Joel

Have you considered Docker?

1 Like