crul pagination helper

, ,

I work to help a lot of users of rOpenSci packages that work with web APIs. Many of these APIs have pagination - that is, you have to make say 10 requests to get all 100 results if there are 10 results per page allowed.

Some users are surprised to find out that they don’t get all results. Many have questions about how to work with pagination and would like it to be done for them.

crul pagination

In the http client (crul) that I maintain I just added a new class: Paginator

You pass an object of class HttpClient with connection details, and specify a few details about how pagination works with the specific API you’re working with - then Paginator takes care of the rest.

This works only with synchronous requests for now, but in theory can get it working for asynchronous too.

example

Update latest dev version on GitHub to get pagination feature

Setup connection details.

(cli <- HttpClient$new(url = "http://api.crossref.org"))
#> <crul connection>
#>   url: http://api.crossref.org
#>   curl options:
#>   proxies:
#>   auth:
#>   headers:

Pass to Paginator, and set the required details:

  • by: method to do pagination by (only one for now query_params, also the default, see docs for future options)
  • limit_param: name of limit parameter
  • offset_param: name of offset parameter
  • limit: total results to get
  • limit_chunk``: results to get per page (chunk)

This doesn’t perform the HTTP requests.

(cc <- Paginator$new(client = cli, by = "query_params", limit_param = "rows",
    offset_param = "offset", limit = 50, limit_chunk = 10))
#> <crul paginator>
#>   base url: http://api.crossref.org
#>   by: query_params
#>   limit_chunk: 10
#>   limit_param: rows
#>   offset_param: offset
#>   limit: 50
#>   status: not run yet

Now call the HTTP method you want, here using GET via get()

cc$get('works')
#> OK

The object now has updated status with number of requests done

cc
#> <crul paginator>
#>   base url: http://api.crossref.org
#>   by: query_params
#>   limit_chunk: 10
#>   limit_param: rows
#>   offset_param: offset
#>   limit: 50
#>   status: 5 requests done

Other methods to use

# get all responses
cc$responses()
# get all HTTP status objects
cc$status()
# get all HTTP status codes
cc$status_code()
# get all times
cc$times()
# get all raw bytes
cc$content()
# parse all responses
cc$parse()

thoughts?

would love any feedback on this - there’s bound to be lots of edge cases because there’s no one way to make a web API

4 Likes

This looks great and I would definitely find this suitable for some situations. One question though:

In:

cc$get(‘works’)

It looks like that call blocks until 50 results come in because we’ve specified 50 as a limit. Would it also be possible to create a Paginator without knowing how many results to get? I’m thinking specifically about either dynamically getting all results from a Solr query or enabling the Paginator to (1) know how many results it got back in a given request and (2) know to stop on the first response with zero results.

2 Likes

Short answer: yes its possible.

Longer answer: There’s two other methods - link_headers and cursors - that haven’t been implemented yet where this will be easier. link headers are links to next pages in the response headers (an example is github api), and cursors are used in Solr APIs. Both of these methods the user could provide a max number of records they want and we just keep doing requests while using information (full url in case of link headers or cursor parameter in case of cursors) to do subsequent requests.

Caveat: Depending on the API, you may run into rate limits in the “get all the results” use cases, which brings up another helper that should be made to throttle on the client side.

This will be tough with some APIs - if they don’t give back total results found for the query - there’s no way to automatically do this.

This is no problem at all - the Paginator is only performing requests but isn’t checking for errors - so errors on client or server side won’t stop the entire thing.

1 Like

Thanks for the info. Sounds like it’ll be useful for sure. I have a few packages that should be migrated to crul but haven’t done that yet.

Cool, glad to hear it will be useful.

This would have made PR 94 to the traits package that adds pagination to traits::bety_GET() a lot more straight forward. I don’t want to delay that PR (or fix what isn’t broken) but … f you are looking for a good use cases and want to refactor, I suspect this new feature could replace about 50 lines of code with one or two …

1 Like

Thanks David. Sounds like a good use case. Does sound good to leave it as is if it works already - maybe some day could replace with this once it’s more battle tested

Hi,

Thanks again for Paginator it’s really helpful. I was wondering if there’s a simple way to handle the case when the limit is not a multiple of of the limit chunk. For example, if I want 27 results by chunk of 10, I have to do 3 request (the last one stopping at 7) but right now it looks like 2 requests is done in Paginator (e.g you get 20 results).

A reprex:

library(crul)

## params
rows <- 27
page_size <- 10

cli <- HttpClient$new(url = "https://data.humdata.org") ## ckan-based
cc <- Paginator$new(client = cli,
                   by = "query_params",
                   limit_param = "rows",
                   offset_param = "start",
                   limit = rows,
                   limit_chunk = page_size)

cc$get(path = "/api/3/action/package_search", query = list(q = "*:*"))
cc
## <crul paginator> 
##   base url: https://data.humdata.org
##   by: query_params
##   limit_chunk: 10
##   limit_param: rows
##   offset_param: start
##   limit: 27
##   status: 2 requests done

thanks for this @dickoa

started fixing, reinstall devtools::install_github("ropensci/crul") and try again, need to make a few tweaks probably still, but should work for your eg at least

1 Like

Woww it works perfectly! Thanks again for crul.