Using Drake with kubeflow

Ferbster · October 16, 2020, 3:01pm

Hi opensci community,

We are currently building a machine learning stack in our company.
One of the tools we plan on heavily using is kubeflow. However, I am highly disappointed with the support for R in kubeflow.
Hence, I tried to figure out how R users could utilize kubeflow in the data science workflow.
On core features of kubeflow obviously is kpf - aka building pipelines -. I was wondering whether anyone of you already tried to let drake communicate with kubeflow and if so, how and how did it go?

If not, would you think that path is worth investigating and further following?

I am looking forward to a lively discussion. Hopefully, some of you already have experiences.

sckott · October 16, 2020, 3:07pm

Thanks for the post - @wlandau any thoughts?

wlandau · October 16, 2020, 5:01pm

drake relies on R packages clustermq and future for all HPC, everything from local multicore processing to distributed computing on clusters. This allows drake to think about scheduling and monitoring targets at a high level without having to get into the details of any one service in particular. If the k8s ecosystem has a resource manager (analogous to traditional schedulers like SLURM) that allows users to tunnel into workers, then it should be possible to build a layer of R that sends jobs and receives in-memory data as output. With that in place, the next step is to wrap that layer into new backends for clustermq and future. Then, drake will automatically be able to talk to k8s. I have recently been discussing new cloud backends with @HenrikBengtsson and Michael Schubert, and they might be open to this.

For kubeflow specifically, it depends on where it fits into the stack. If it operates as a top-level workflow automation tool with a DAG and everything, then it might be awkward to have it work in tandem with drake. But if kubeflow is more like SLURM or AWS Batch, then the direction I described above might be possible.

From my perspective, cloud integration is by far the greatest unmet need in R. Traditional HPC on private supercomputers is dying out, and data science is moving to AWS, Google Cloud, k8s, etc. R users need the simple ability to submit an in-memory job and get in-memory data back, with minimal config, minimal manual setup/teardown, and minimal cost. I believe this alone is worthy of several R Consortium grants.

Short answer: unfortunately, a lot of infrastructure needs to be built up before drake can seamlessly interact with the k8s ecosystem.

Ferbster · October 16, 2020, 5:20pm

Thanks for the post @wlandau these are great insights. Basically I agree with everything you just said. Especially your thoughts on cloud integration of R.

My thoughts are probably a bit naive, but then again, I am not an infrastructure guy so bare that in mind
Here is my line of thought. Kubeflow offers with KFP a pipelining tool, from my understanding and tinkering, it pretty much does what drake does, but in the k8s eco system. However it relies on a Python SDK and yaml files. Hence, it is quite a stretch to integrate this into R. You could think about reticulate or some docker images but it feels cumbersome. My idea was to kind of “translate” the drake pipe into a KFP pipe.
Again, maybe a very naive way of thinking.

wlandau · October 16, 2020, 5:36pm

Yeah, it sounds like the closest we could expect to get is an automated converter from a drake plan to the YAML for KPF.

Incidentally, drake originally had something like this internally to turn plans into Makefiles. In other words, make(plan, parallelism = "Makefile") would create a Makefile based on the plan and then run the latter. Each Makefile recipe looked something like Rscript -e 'drake::run_drake_target("drake_target_name")', and there were hidden text files to translate drake's target invalidation rules into time stamps that GNU Make could understand. (And that’s why the name “drake” stands for “Data frames in R for Make”.) The whole Makefile backend turned out to be super clunky. drake quickly outgrew it, and I removed it in version 7 in early 2019.

HenrikBengtsson · October 16, 2020, 5:53pm

FYI, Christopher Paciorek at UC Berkeley has been working on using futures with Kubernetes. We had a meeting, I’ve made some changes to future to make his life a bit easier. He said he’ll refactor the documentation quite a bit but you can peek at it at https://github.com/paciorek/future-kubernetes.

Ferbster · October 20, 2020, 7:54am

Thanks @HenrikBengtsson that looks promising.
We are actually keen on contributing.

Topic		Replies	Views
Community Call - Reproducible Workflows at Scale with drake Blog reproducibility , drake , community-call	1	844	September 26, 2019
The prequel to the drake R package Blog reproducibility	0	663	February 6, 2018
Teaching an introduction to workflow management using drake UseCases drake	0	1474	March 25, 2021
Publish R Drake workflow to WorkflowHub Package Use Questions	2	390	November 30, 2021
Using {drake} for machine learning UseCases	0	1380	May 22, 2020

Using Drake with kubeflow

Related topics