Hello,
I’ve spent the last year getting well-acquainted with R and it’s machine learning libraries by developing a couple of projects of personal interest. As I’ve done so, I have been impressed with how well S3 works to allow different machine learning libraries can be utilized with a common set of functions (like predict()
).
As I’ve developed my own algorithms and code, I started to build a framework that makes it easier for me to prepare a data set, build a machine, train, test, and visualize the results. Of course, my use cases are naturally limited - I’m focused mostly on regression using support vector machine, random forest, glm, etc. But the framework, as I’ve developed it, seems like it might be overall more useful. I’m about 1/3rd of the way through early development, but I’ve named it “machineRy” and have built a skeleton package.
The framework as I’ve developed it, is broken into three key components: data
, machine
, and plot
.
Each component represents a key element of the machine learning process: preparing the data, machine training and prediction, and managing graphical plots / text output.
The advantage to using a framework is that it “canonizes” each component to make it easy to customize each step while ensuring consistent performance. For example, the data preparation may involve transforming variables with non-guassian distributions, scaling / centering, expanding factor variables into 1-of-n encodings, etc. Any (or all) of these steps could be easily implemented or omitted while preserving correct order (e.g., scaling / centering should follow variable transformation). Additionally, user-defined functions could be substituted into any of these steps (e.g., implementing a BoxCox transform rather than the default log transform), or additional user-defined functions could be inserted into the overall process.
Likewise, the machine
component would have the same modularity to it, allowing mixing-and-matching potentially incompatible machines to create ensembles. In my case, I’ve built an ensemble which consisted of a couple machines from the H2O R library, the svm model from the e1071 package, glm, etc. But implementing an ensemble (like a stacked generalizer) with disparate machines such as these requires some work to ensure common output as well as necessary manipulation as results are passed between machines. Working out these sorts of issue are primarily what I’m interested in addressing with this framework.
I’ll stop with that. I fully expect to implement this framework because my code is otherwise becoming unmanageable and it’s getting difficult to build and test new machines as I try to improve model performance. That said, I thought that it might be of use to others, but before I presume as much, I felt it’d be worth vetting the idea and really exploring whether my needs represent those of a greater whole or are just the product of my niche of research…