R interface for AWS services

Is there really no mature client to interact with AWS services from R (or more simply, just S3 buckets)?

I see 7 or so packages listed in Scott & co’s taskview (http://cran.r-project.org/web/views/WebTechnologies.html), but most of those with S3 access are just wrappers to the aws cli tool that must thus be installed separately, and it appears none are on CRAN. (The one that was, awsTools, is now deprecated even in it’s dev version in favor of the author’s new https://github.com/armstrtw/Rawscli, which looks like just another thin wrapper around aws cli.

My ideal would be some actively maintained CRAN version via httr implementation; this gist from Hadley looks pretty close for the S3 side of things: https://gist.github.com/hadley/5532482. (Perhaps the XML dependency could now be swapped out for xml2). Currently this implements only the GET method, and looks like the response object handling needs to be tweaked a bit to work. It looks like POST methods (and maybe PUT as well?) require rather more complex creation of signing keys described in the Amazon documentation to work: http://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html

Anyway, is anyone familiar with better options or would this be a reasonable project that would catch anyone’s interest?

In case anyone is interested in this, Thomas Leeper pointed me to his cloudyr project, which already has a great start on a nice RESTful, stand-alone & platform agnostic interface to the AWS APIs: http://cloudyr.github.io/ Definitely check it out and get involved if this sounds interesting!

2 Likes

hey @cboettig - Good question

Does it look like Leeper’s cloudyr org has the pkg(s) you’re looking for? Or at least a good start on what you’re looking for?

Not yet, but we’re working on it. (Or rather, I tried and failed and am
looking for others to help me get unstuck: see


)

1 Like

Cool, sounds good…

Chiming in a few months later. Still no mature solution for initializing aws instances and passing them R scripts. After alot of googling, it seems like people go for starcluster (headache to install, seriously overkill for embarrassingly parallel tasks) or they seem to be manually ssh into each instance and pasting scripts. It looks like rawscli (https://github.com/armstrtw/Rawscli) is supposed to go in that direction, but no development since @cboettig posted a few months back. I think alot of people are looking for something that goes like.

```{r}
#initialize amazon repos
hostnames<-init(ami=,ssh_key=,instance_count=)
cl<-makeCluster(hostnames,"SOCK")
#send files through scp?
#Register cluster backend?
foreach(x=list of tasks){
    install.packages()
    perform tasks
}

Does anyone know if such a framework exists yet?

Just because this popped up in search results, it’s worth saying that interop with AWS still is a long standing problem for the R community. Cloudyr has continued to make great strides… but AWS switches up their web APIs a bit too much and the surface area is huge (and expanding). Having programmed in R against AWS a lot over the past five years, my take is that your best bet is to program against an AWS-supported SDK. There is none for R, but language interop is a place where R shines. {reticulate} (Python) and using boto3 is my current favored approach. In the past I’ve also used {rJava} to connect to the AWS Java SDK (my package for doing that was {awsjavasdk}).

As for starting up a temporary cluster? It’s a headache - but doable. But, I’ve long since stopped maintaining any tooling for doing that. If I’ve a problem that requires more than the 24 TB of RAM offered by a u-24tb1.metal and its 448 processors… then I don’t really want to be fussing with a SOCK cluster.