Docker and general scientific computing set-ups


#1

@cboettig and I have been meaning for a while to have a conversation about computing set-ups and reproducibility, as we’re both starting new gigs where we have the opportunity to re-think how we organize our scientific computing workflows. We figured we might as well have it here so others can chime in.

To start, I have a bunch of questions about Docker. It seems to me it’s a tool one could make central to your computing set-up, for the purpose of both having reproducible projects, maintaining a general computing setup across machines, and quickly firing up cloud assets when you need them. Here goes:

  • Let’s say I have a Dockerfile that describes my general computing setup, meaning that it has a lot more stuff than I need for one project: It has R/RStudio, Anaconda/Jupyter, node.js, a bunch of assorted command line and connectivity tools, custom pandoc templates, dotfile configurations, and files with API keys, etc. Each custom part might be in GitHub repos, but in general I want this setup ready to go, and quickly portable in case I decide I want to move to a more powerful machine, move to work offline, my hardware fails or I lock myself out of my office. What should be my workflow for keeping it updated? How do I store the image so I can quickly fire it up on either my laptop, my powerful headless server, or a cloud asset?

  • For a project, I might have a Dockerfile that fires up the minimal setup for that analysis. If my project is, say, organized as an R package on Github, how and where should I store its associated Dockerfile? What should be in the Dockerfile? (For instance, should packages be there, or in something like Packrat?). Should I strive to use generic docker images available elsewhere?

  • I think of my general work interface into a working Machine as a terminal via SSH, plus a browser with RStudio or Jupyter notebook. If the machine is a Docker image, local or remote, would this be any different?

  • There’s another workflow in which the whole Docker image is fired up, runs some simulation or analysis, and returns the results and shuts down. When do you use this, how does it work?

  • What research-related content doesn’t live inside this ecosystem? Like, there’s no need to keep your PDFs in all this.

  • Most collaborators aren’t going to know anything about this stuff. Does this type of approach introduce any new friction when working people with a less computational bent? And oh yeah, data. I’ll be doing a lot more work where I need to share data on others’ terms. Where do you keep data in all this?

Other little things:

  • Should I use Vagrant on my OSX laptop so I can have Ubuntu as my common Docker OS rather than Tiny Core OS?
  • How do you use Docker with, or in lieu of, Travis-CI and related services and tasks?

#2

Hi Noam, folks,

Noam, thanks for kicking this off, I think you raise a bunch of good
questions here. Hope some are of interest to others as well, I think folks
should chime in with other opinions, could be addressed by other tools, or
just to point out where something needs more clarification as to what the
heck we’re talking about. Also a disclaimer that there’s lots of different
ways to use docker in general, which may be a strength but also leads to a
lot of confusion about what it is and what it’s good for. Others may use
it in very different ways or contexts, but I think it’s a good discussion
to have.


#3

Let’s say I have a Dockerfile that describes my general computing setup,
meaning that it has a lot more stuff than I need for one project: It has
R/RStudio, Anaconda/Jupyter, node.js, a bunch of assorted command line
and connectivity tools, custom pandoc templates, dotfile configurations,
and files with API keys, etc. Each custom part might be in GitHub
repos, but in general I want this setup ready to go, and quickly
portable in case I decide I want to move to a more powerful machine,
move to work offline, my hardware fails or I lock myself out of my office.
What should be my workflow for keeping it updated? How do I store the
image so I can quickly fire it up on either my laptop, my powerful
headless server, or a cloud asset?

Hosting the image on Docker Hub is probably the best way to go. You get one
private image slot free; though it’s pretty easy to set up your own
’docker registry’ (which is open source, like docker itself) on your own
server. This is pretty analogous to putting git on a server – the
open source registry software has no web front end, but it makes it easy
to host as many private images as you like (or can fit on a server).
Transferring via the Hub or private docker registry is almost always
faster and easier than moving the image around as a tar ball or anything
else.

My suggestion is to build the image based on
existing images maintained by someone else. I find it easier to have a
few different images for different major software (e.g. an R/RStudio
image based on the rocker project and a separate Python/Juypter image
based on their official images), but that’s a matter of taste, and might
be more of a nuisance if you want the same private data to available on
each image. Most of my images are primarily software, and the data I
pull from git/github, or by linking local volumes that have data. Your
kitchen sink approach has the elegance of simplicity over my approach
– since pulling from git (using credentials passed as env vars) is an
extra step, and linking local volumes is obviously less portable. I’m
still deciding what I like best myself. One advantage of my approach is
it has a bit more clean isolation between the image needed to provide
the software environment which you might want to share with others,
while keeping any private data securely off the image. I’ve also had
success linking ‘volume containers’ to provide the private data part
while still being portable.

The best way to keep an image updated, I find, is to always use it, whether working locally or
not. How to “keep it updated” is really quite a few different topics.
The first bifurcation point here is regarding docker commit.

Using docker commit is probably the most intuitive way to go. You start with some
prebuilt image, install your software, add/generate you data, and then
use docker commit any time you want to save/move your data/software,
as you would with git commit. You can give a commit message, roll
back commits, etc. You run docker push and docker pull to move the
image up to the hub and pull it down on other computers. You update
installed software or add new software just as you would on any linux
server (e.g. for debian-based containers: apt-get update, apt-get
upgrade, apt-get install, etc). This is probably the way to go,
particularly with the kitchen-sink container that has your own data on
it. In this approach, you actually never have to write a Dockerfile
either.

A different way to go is to write your own
Dockerfile that installs the software and possibly adds the data you
always want available. This is a recipe to construct just what you want,
rather than the full history of everything you’ve ever had (e.g. is a
more lean approach). To update, you would first pull the latest version
of the image you built on, and then rebuild your image from your
Dockerfile. Unless you’ve locked specific versions of software in your
install commands, this approach will just pull the latest versions as it
rebuilds, so ‘upgrading’ software (or updating datafiles) is never
necessary. The Docker Hub can do this automatically if you just point
it at a Dockerfile on GitHub. You can link the Docker Hub entry to
automatically rebuild when the base image you used is updated, which
makes sure you get those software updates too. You can also tickle a
rebuild (say, nightly) with a post request. This workflow has a bit
more overhead but is a bit more transparent (whether just to you in the
case of a private image or to everyone for a public image), because you
have to maintain this recipe that rebuilds things from scratch all the
time. Naturally, the reproducible research benefits are thus a bit
higher on this side, I think.

I actually use a mix
of both of these approaches, depending on the context. I tend to use
docker commit with ‘data-only’ containers, and the Dockerfile
approach with “software-only” containers.

For a project,
I might have a Dockerfile that fires up the minimal setup for that
analysis. If my project is, say, organized as an R package on Github,
how and where should I store its associated Dockerfile? In
the same git repo, though you can put it anywhere you like. (e.g.
inst/docker, or at top level and add it to .Rbuildignore depending
on preferences). What should be in the Dockerfile? (For instance, should packages be there, or in something like Packrat?).

Good question. Note that you can mix and match – your Dockerfile can use
packrat commands to ensure it gets specific versions of packages (or any
of the packrat alternatives as well). I would think of your Dockerfile
like you would a .travis.yml file: it should contain the recipe to
install the complete software environment needed to run your analysis.
In fact, check out Rich FitzJohn’s totally awesome https://github.com/traitecoevo/dockertest
which automates the creation of said Dockerfile for an R package (with
the goal of, for instance, making it easier to debug travis errors
without having to push to travis).

Should I strive to use generic docker images available elsewhere?

Yup, re-using generic images (provided you trust them (e.g. preferably are
automated builds, ideally are signed as official images, and at very
least are regularly updated) is a good idea, makes things more efficient
transferring images around and building them, etc.

I think
of my general work interface into a working Machine as a terminal via
SSH, plus a browser with RStudio or Jupyter notebook. If the machine is a
Docker image, local or remote, would this be any different?

Short answer: That basic approach works just fine. Longer answer:
You cannot actually ssh into a docker image unless it is running an ssh
server; which most don’t install by default, but is easy enough to add.
A common pattern is to ssh into the remote machine only to launch the
docker image that provides the web service like RStudio or Jupyter, and
from then on rely on that web interface for most everything else (e.g.
coding, installing additional software, git push/pull etc). Such a user
might occasionally ssh onto the remote machine to manage the image
(e.g. docker commit, docker push, to move a copy to the cloud), but note
that these commands are not run inside the container, just on the
host. Such a user might occasionally need to enter the container, e.g.
to do some maintenance task that requires root privileges (typically not
granted at the web-interface level). While this can be done using
ssh, it’s generally easier and preferred to just run something like
docker exec -ti container-name bash from the host (e.g. after ssh-ing
if your host is remote) to get ‘inside’ and poke around.

Working locally you clearly never need ssh, even with docker containers. As I
hinted at above, my containers typically contain only computational
environments, not specific data I’m working on, and I tend to link local
volumes. This workflow looks identical to a non-docker based workflow
– it’s relatively invisible that your software is running in a
container, while I can use the same local programs for text editing,
git, file browsers, etc that I always do. …

There’s another workflow in which the whole Docker image is fired up, runs some
simulation or analysis, and returns the results and shuts down. When
do you use this, how does it work?

Running in “batch mode” with docker isn’t that different than it is in other contexts. However, docker makes this possible /easy in the context of cloud computing, where you first need to start a cloud server, and then install all the necessary software, and finally, shut the instance down when you’re done to avoid being charged for idle clock cycles.
This is also very similar to using Travis or other CI, particularly when using the “deploy” options to push results (e.g. building a Jekyll site) up to some source.

A key tool worth highlighting here is Docker machine. Docker machine is a super simple command-line program that
automates launching a cloud machine running docker (or even a local virtual machine like virtualbox). This makes it really easy to create and destroy cloud instances automatically. Typically, I use a bash script that runs these steps:

  • Spawn the cloud instance with desired number of cpus/ram using a docker-machine call
  • pull & run the docker container with the desired command
  • close the instance down

This assumes that the data/scripts to be run are either on the docker image
already, or are pulled onto the running container by the ‘desired
command’ (may be another shell script on the container), and that the
results are pushed off the server (e.g. back up to GitHub) by the
’desired’ command.

What research-related content doesn’t live inside this ecosystem? Like, there’s no need to keep your PDFs in all this.

All up to you. For me, most of it doesn’t: my docker containers are mostly the boiler-plate software, not the research data I create. The latter lives in GitHub (or elsewhere for large data), and is pulled down at runtime. Whether you put these things in the docker image is really another version of when/whether you put such output objects in, say, your Github repo, or aim to provide only a ‘make clean’ environment, with derived outputs stored somewhere else. (The space cost is obviously more negligible, but the conceptual question is the same). Perhaps the docker container is a good option for ‘somewhere else’, but probably something more general purpose is better (e.g. zenodo, gh-pages, or Amazon S3, depending on the object etc).

Most collaborators aren’t going to know anything about this stuff. Does this
type of approach introduce any new friction when working people with a
less computational bent? And oh yeah, data. I’ll be doing a lot more work where I need to share data on others’ terms. Where do you keep data in all this?

No new friction. Because there is no need or expectation for collaborators to use
docker, any more than I expect a collaborator to sit down at my laptop
and navigate my personal file system and tools. At minimum, this
approach is no harder than not using docker to collaborate because it is
no different. (That does not imply that such collaboration is easy!)
Beyond that, I’d say it is loosely analogous to any other computational
tool. Does using GitHub (or, to go back a few decades, using email for
that matter) make it harder to work with people without a computational
bent? At minimum, it helps you and they ignore it. If you’re lucky,
they might pick up the technology too and then it pays real dividends.

Re data: as I’ve hinted at above, I see this more as just working with
however others store data than providing a data storage mechanism
itself. Data should live where-ever it should live – chances are
that’s not just “on a docker container alone” any more than it is “on my
harddrive alone.” But you have the full suite of tools available to
you from inside docker to interface with data: be it rOpenSci APIs,
Amazon S3, GitHub, etc.

Most notably, Docker works really well
linking to external databases like Redis, Postgres, MongoDB, and the
rest (usually run in separate portable docker containers). This
approach is extremely powerful and adapts/scales well in cloud.

Other little things:

Should I use Vagrant on my OSX laptop so I can have Ubuntu as my common Docker OS rather than Tiny Core OS?I

dunno, but it seems most Docker devs (most any devs?) use macs these
days, so I’d use what they use. That appears docker-machine, which
creates a virtualbox instance running boot2docker (much more light
weight than a full Vagrant+virtualbox+docker workflow, and much less to
learn). Also check out https://kitematic.com/ for easy docker setup on Mac? Lemme know how it goes!

How do you use Docker with, or in lieu of, Travis-CI and related services and tasks?

While almost every CI platform, including travis, use Docker themselves, few
let you “bring your own docker containers”, at least not for free.
CircleCi does. This is really awesome, because I don’t have to write
and debug different .travis.yml files for every project, the CI just
runs my container.

For instance, I use CircleCI to
compile my knitr / pandoc / jekyll labnotebook on every push and by via
nightly build triggers (POST request from cron job). Even though some
posts take days to run even on multicore machines, CircleCI can build
the whole notebook in a few minutes because it pulls down the data
container I used when first writing the post, which contains the knitr
cache files. Meanwhile I can just let Circle run the R code for any
non-intensive posts; and it will update the cache file container on
docker hub as well as updating my site.


#4

First, many thanks for your comprehensive replies! Thoughts, some of which are questions:

I think I can make most of this setup public on Docker Hub. Private data is small or project-based, so I can just a have a private git repo for API keys and dotfiles that downloads and runs when I run the docker image.

Oh, perfect, my concern with the lightweight Dockerfile-based approach was that it would take forever to build the image every time I needed it.

What are your “data-only” containers? Are these where you keep the results of analyses? So, for instance, I would have a data container of all the results of simulations for a particular project? Or for the data I’m analyzing for the project?

I think I understand this, but the shell is still a significant work environment for me and the shell interface of RStudio is limiting. But I can setup the dockerfile that launches my RStudio server, and ssh/mosh as well.

Cool! My individual projects typically have a inst/notebook/ directory for Rmd files, and I’ve wanted to create associated gh-pages notebooks that would build with each commit. I’ve never had a good solution as to whether/how to keep the cache files for big runs. Can you point me to the code/setup for this?


#5

I use one for the knitr caches of my notebooks. Actually, I think this mostly gratuitous: it would be more sensible to deploy and pull the cache from Amazon S3 than to link a volume container. (Notably, because I am not interested in preserving the history of old caches, the latter requires some hijinks). However, S3 is very cheap, but docker hub is free.

A much more meaningful way to do this is probably using when using data stored in a real database architecture. We do this for the fishbaseapi data, but I don’t yet use it in other contexts.

Definitely. If the shell is a desired environment/service and not just a barrier between you and a desired service, then adding an ssh server to your container makes perfect sense.

Like I said, the way I’m doing this is convoluted and would be cleaner with S3. You ain’t gonna like this:

It looks like just 2 lines of code: line 12 tells docker to build the notebook (via Yihui’s awesome servr package for jekyll/knitr integration) using the linked “volume container” that has the cache. Then the deploy just uses docker to run the deploy script (pushing stuff to the correct git branch).

But there are hijinks hiding in the deploy script that destroy and recreate the cache container to avoid storing stale cache files in old docker layers. I think this is an example of over-engineering.

A clean solution should rely on general purpose tools for general purposes (e.g. put caches in S3 like everyone else does), and just let the container be the computational environment.


#6

Maybe ugly but there’s a lot of useful example information in that script!

A few more questions re: high-performance computing:

  • Running a docker image on a linux box, can I do multicore computing without any additional setup? My guess is yes, because it uses the local kernel.
  • Running a docker image through boot2docker, can I do multicore computing on my OSX laptop? I suspect there are more issues relating to the VM here. (Maybe someone else can chime in here)
  • Any useful examples of parallelism on cloud machines with docker and R? Especially multi-machine/cluster?
    • Relatedly, what’s your preferred provider when you need many-core or cluster cloud instances?

#7

Correct, you get multicore for free running containers on linux (though if you want, you can constrain containers to a fixed number of cores, ram, or units of cpu). Container performance is virtually equivalent to a native installation.

I believe so; this would be set in virtualbox. Not quite sure how boot2docker handles it, but there are simple flags for this in docker-machine, (just install the mac version of docker-machine, works basically like boot2docker but more general), e.g.

docker-machine create --virtualbox-cpu-count 4 machine-name

Google. At least right now, they offer the highest cpu/memory ratio in their high-cpu images, which gives the lowest cost-per-cpu, and they charge in increments of 1 minute (after the first 10 min) instead of 1 hr, (which is obviously great if you run lots of jobs that take only an hour or two to complete). Otherwise, digitalocean pricing is simple and usually as good or better than the alternatives.

Mostly I just use the 32 core machines; I may break a job over multiple machines my hand but haven’t tried to automate that from within R. Rich FitzJohn has done more of that stuff, but you might do better following up with him directly on that.

p.s. the very fast network connections offered by most cloud providers are a nice bonus when pulling/pushing large docker images for the first time!


#8

Oh, cool. @richfitz, can you elaborate or point us to anything?


#9

This is all very much a work in progress, and there are two parts to the approach.

The first is generating an abstracted interface over a set of virtual machines - a group of software engineers we are collaborating are doing that. We’re using mesos and marathon with a frontend around some ansible scripts. I think that organising the security stuff to make a little private cluster is probably hard part. It’s not public (out of my hands) and there’s nothing I can share on that end at the moment. OTOH, something like docker machine can do this very nicely I believe and will probably overtake our effort.

The second part is something that can use that bunch of computing. What I arrange to bring up is one redis container (the official one) and then build a container for a project that contains all the dependencies plus some queuing software (see below). Then I launch n of the project containers so that they can see the redis container and also a shared volume (that could be done with a docker volume I presume but that’s not the approach we’ve taken, but perhaps we should!). Then I can connect to the redis server on AWS from the comfort of my own computer (via an SSH tunnel) and queue jobs, check on the status of jobs, etc

The pieces involved:

I use dockertest for this which lets you write a yaml file like this. The versions for projects (rather than packages) is about the same. This is optional and you can build containers by hand if you’d rather!

The queuing approach is rrqueue. The idea is that any number of worker R process poll a Redis database for new jobs. You can scale the worker pool up and down after submission (think mclapply but you can add more cores)[*]. Unlike things like mclapply the controller process is not blocking so you can Ctrl-C after running the equivalent of mclapply, reattach later, get information on how far through jobs are. It’s also not just for map/lapply like tasks so if you have jobs that depend on other jobs you could submit jobs once all the dependency jobs have completed. Basically none of that is documented though. But we’ve been using it as an alternative to mclapply locally and on AWS for a couple of months now.

* Doing this in practice would require being able to scale the cluster itself which the software people have not implemented.