Let’s say I have a Dockerfile that describes my general computing setup,
meaning that it has a lot more stuff than I need for one project: It has
R/RStudio, Anaconda/Jupyter, node.js, a bunch of assorted command line
and connectivity tools, custom pandoc templates, dotfile configurations,
and files with API keys, etc. Each custom part might be in GitHub
repos, but in general I want this setup ready to go, and quickly
portable in case I decide I want to move to a more powerful machine,
move to work offline, my hardware fails or I lock myself out of my office.
What should be my workflow for keeping it updated? How do I store the
image so I can quickly fire it up on either my laptop, my powerful
headless server, or a cloud asset?
Hosting the image on Docker Hub is probably the best way to go. You get one
private image slot free; though it’s pretty easy to set up your own
‘docker registry’ (which is open source, like docker itself) on your own
server. This is pretty analogous to putting git on a server – the
open source registry software has no web front end, but it makes it easy
to host as many private images as you like (or can fit on a server).
Transferring via the Hub or private docker registry is almost always
faster and easier than moving the image around as a tar ball or anything
else.
My suggestion is to build the image based on
existing images maintained by someone else. I find it easier to have a
few different images for different major software (e.g. an R/RStudio
image based on the rocker project and a separate Python/Juypter image
based on their official images), but that’s a matter of taste, and might
be more of a nuisance if you want the same private data to available on
each image. Most of my images are primarily software, and the data I
pull from git/github, or by linking local volumes that have data. Your
kitchen sink approach has the elegance of simplicity over my approach
– since pulling from git (using credentials passed as env vars) is an
extra step, and linking local volumes is obviously less portable. I’m
still deciding what I like best myself. One advantage of my approach is
it has a bit more clean isolation between the image needed to provide
the software environment which you might want to share with others,
while keeping any private data securely off the image. I’ve also had
success linking ‘volume containers’ to provide the private data part
while still being portable.
The best way to keep an image updated, I find, is to always use it, whether working locally or
not. How to “keep it updated” is really quite a few different topics.
The first bifurcation point here is regarding docker commit
.
Using docker commit
is probably the most intuitive way to go. You start with some
prebuilt image, install your software, add/generate you data, and then
use docker commit
any time you want to save/move your data/software,
as you would with git commit
. You can give a commit message, roll
back commits, etc. You run docker push
and docker pull
to move the
image up to the hub and pull it down on other computers. You update
installed software or add new software just as you would on any linux
server (e.g. for debian-based containers: apt-get update, apt-get
upgrade, apt-get install, etc). This is probably the way to go,
particularly with the kitchen-sink container that has your own data on
it. In this approach, you actually never have to write a Dockerfile
either.
A different way to go is to write your own
Dockerfile that installs the software and possibly adds the data you
always want available. This is a recipe to construct just what you want,
rather than the full history of everything you’ve ever had (e.g. is a
more lean approach). To update, you would first pull the latest version
of the image you built on, and then rebuild your image from your
Dockerfile. Unless you’ve locked specific versions of software in your
install commands, this approach will just pull the latest versions as it
rebuilds, so ‘upgrading’ software (or updating datafiles) is never
necessary. The Docker Hub can do this automatically if you just point
it at a Dockerfile on GitHub. You can link the Docker Hub entry to
automatically rebuild when the base image you used is updated, which
makes sure you get those software updates too. You can also tickle a
rebuild (say, nightly) with a post request. This workflow has a bit
more overhead but is a bit more transparent (whether just to you in the
case of a private image or to everyone for a public image), because you
have to maintain this recipe that rebuilds things from scratch all the
time. Naturally, the reproducible research benefits are thus a bit
higher on this side, I think.
I actually use a mix
of both of these approaches, depending on the context. I tend to use
docker commit
with ‘data-only’ containers, and the Dockerfile
approach with “software-only” containers.
For a project,
I might have a Dockerfile that fires up the minimal setup for that
analysis. If my project is, say, organized as an R package on Github,
how and where should I store its associated Dockerfile? In
the same git repo, though you can put it anywhere you like. (e.g.
inst/docker
, or at top level and add it to .Rbuildignore
depending
on preferences). What should be in the Dockerfile? (For instance, should packages be there, or in something like Packrat?).
Good question. Note that you can mix and match – your Dockerfile can use
packrat commands to ensure it gets specific versions of packages (or any
of the packrat alternatives as well). I would think of your Dockerfile
like you would a .travis.yml file: it should contain the recipe to
install the complete software environment needed to run your analysis.
In fact, check out Rich FitzJohn’s totally awesome GitHub - traitecoevo/dockertest: Run tests in docker containers
which automates the creation of said Dockerfile for an R package (with
the goal of, for instance, making it easier to debug travis errors
without having to push to travis).
Should I strive to use generic docker images available elsewhere?
Yup, re-using generic images (provided you trust them (e.g. preferably are
automated builds, ideally are signed as official images, and at very
least are regularly updated) is a good idea, makes things more efficient
transferring images around and building them, etc.
I think
of my general work interface into a working Machine as a terminal via
SSH, plus a browser with RStudio or Jupyter notebook. If the machine is a
Docker image, local or remote, would this be any different?
Short answer: That basic approach works just fine. Longer answer:
You cannot actually ssh into a docker image unless it is running an ssh
server; which most don’t install by default, but is easy enough to add.
A common pattern is to ssh into the remote machine only to launch the
docker image that provides the web service like RStudio or Jupyter, and
from then on rely on that web interface for most everything else (e.g.
coding, installing additional software, git push/pull etc). Such a user
might occasionally ssh onto the remote machine to manage the image
(e.g. docker commit, docker push, to move a copy to the cloud), but note
that these commands are not run inside the container, just on the
host. Such a user might occasionally need to enter the container, e.g.
to do some maintenance task that requires root privileges (typically not
granted at the web-interface level). While this can be done using
ssh
, it’s generally easier and preferred to just run something like
docker exec -ti container-name bash
from the host (e.g. after ssh-ing
if your host is remote) to get ‘inside’ and poke around.
Working locally you clearly never need ssh, even with docker containers. As I
hinted at above, my containers typically contain only computational
environments, not specific data I’m working on, and I tend to link local
volumes. This workflow looks identical to a non-docker based workflow
– it’s relatively invisible that your software is running in a
container, while I can use the same local programs for text editing,
git, file browsers, etc that I always do. …
There’s another workflow in which the whole Docker image is fired up, runs some
simulation or analysis, and returns the results and shuts down. When
do you use this, how does it work?
Running in “batch mode” with docker isn’t that different than it is in other contexts. However, docker makes this possible /easy in the context of cloud computing, where you first need to start a cloud server, and then install all the necessary software, and finally, shut the instance down when you’re done to avoid being charged for idle clock cycles.
This is also very similar to using Travis or other CI, particularly when using the “deploy” options to push results (e.g. building a Jekyll site) up to some source.
A key tool worth highlighting here is Docker machine. Docker machine is a super simple command-line program that
automates launching a cloud machine running docker (or even a local virtual machine like virtualbox). This makes it really easy to create and destroy cloud instances automatically. Typically, I use a bash script that runs these steps:
- Spawn the cloud instance with desired number of cpus/ram using a docker-machine call
- pull & run the docker container with the desired command
- close the instance down
This assumes that the data/scripts to be run are either on the docker image
already, or are pulled onto the running container by the ‘desired
command’ (may be another shell script on the container), and that the
results are pushed off the server (e.g. back up to GitHub) by the
‘desired’ command.
What research-related content doesn’t live inside this ecosystem? Like, there’s no need to keep your PDFs in all this.
All up to you. For me, most of it doesn’t: my docker containers are mostly the boiler-plate software, not the research data I create. The latter lives in GitHub (or elsewhere for large data), and is pulled down at runtime. Whether you put these things in the docker image is really another version of when/whether you put such output objects in, say, your Github repo, or aim to provide only a ‘make clean’ environment, with derived outputs stored somewhere else. (The space cost is obviously more negligible, but the conceptual question is the same). Perhaps the docker container is a good option for ‘somewhere else’, but probably something more general purpose is better (e.g. zenodo, gh-pages, or Amazon S3, depending on the object etc).
Most collaborators aren’t going to know anything about this stuff. Does this
type of approach introduce any new friction when working people with a
less computational bent? And oh yeah, data. I’ll be doing a lot more work where I need to share data on others’ terms. Where do you keep data in all this?
No new friction. Because there is no need or expectation for collaborators to use
docker, any more than I expect a collaborator to sit down at my laptop
and navigate my personal file system and tools. At minimum, this
approach is no harder than not using docker to collaborate because it is
no different. (That does not imply that such collaboration is easy!)
Beyond that, I’d say it is loosely analogous to any other computational
tool. Does using GitHub (or, to go back a few decades, using email for
that matter) make it harder to work with people without a computational
bent? At minimum, it helps you and they ignore it. If you’re lucky,
they might pick up the technology too and then it pays real dividends.
Re data: as I’ve hinted at above, I see this more as just working with
however others store data than providing a data storage mechanism
itself. Data should live where-ever it should live – chances are
that’s not just “on a docker container alone” any more than it is “on my
harddrive alone.” But you have the full suite of tools available to
you from inside docker to interface with data: be it rOpenSci APIs,
Amazon S3, GitHub, etc.
Most notably, Docker works really well
linking to external databases like Redis, Postgres, MongoDB, and the
rest (usually run in separate portable docker containers). This
approach is extremely powerful and adapts/scales well in cloud.
Other little things:
Should I use Vagrant on my OSX laptop so I can have Ubuntu as my common Docker OS rather than Tiny Core OS?I
dunno, but it seems most Docker devs (most any devs?) use macs these
days, so I’d use what they use. That appears docker-machine, which
creates a virtualbox instance running boot2docker (much more light
weight than a full Vagrant+virtualbox+docker workflow, and much less to
learn). Also check out https://kitematic.com/ for easy docker setup on Mac? Lemme know how it goes!
How do you use Docker with, or in lieu of, Travis-CI and related services and tasks?
While almost every CI platform, including travis, use Docker themselves, few
let you “bring your own docker containers”, at least not for free.
CircleCi does. This is really awesome, because I don’t have to write
and debug different .travis.yml files for every project, the CI just
runs my container.
For instance, I use CircleCI to
compile my knitr / pandoc / jekyll labnotebook on every push and by via
nightly build triggers (POST request from cron job). Even though some
posts take days to run even on multicore machines, CircleCI can build
the whole notebook in a few minutes because it pulls down the data
container I used when first writing the post, which contains the knitr
cache files. Meanwhile I can just let Circle run the R code for any
non-intensive posts; and it will update the cache file container on
docker hub as well as updating my site.