I work to help a lot of users of rOpenSci packages that work with web APIs. Many of these APIs have pagination - that is, you have to make say 10 requests to get all 100 results if there are 10 results per page allowed.
Some users are surprised to find out that they don’t get all results. Many have questions about how to work with pagination and would like it to be done for them.
crul pagination
In the http client (crul
) that I maintain I just added a new class: Paginator
You pass an object of class HttpClient
with connection details, and specify a few details about how pagination works with the specific API you’re working with - then Paginator
takes care of the rest.
This works only with synchronous requests for now, but in theory can get it working for asynchronous too.
example
Update latest dev version on GitHub to get pagination feature
Setup connection details.
(cli <- HttpClient$new(url = "http://api.crossref.org"))
#> <crul connection>
#> url: http://api.crossref.org
#> curl options:
#> proxies:
#> auth:
#> headers:
Pass to Paginator
, and set the required details:
-
by
: method to do pagination by (only one for nowquery_params
, also the default, see docs for future options) -
limit_param
: name of limit parameter -
offset_param
: name of offset parameter -
limit
: total results to get - limit_chunk``: results to get per page (chunk)
This doesn’t perform the HTTP requests.
(cc <- Paginator$new(client = cli, by = "query_params", limit_param = "rows",
offset_param = "offset", limit = 50, limit_chunk = 10))
#> <crul paginator>
#> base url: http://api.crossref.org
#> by: query_params
#> limit_chunk: 10
#> limit_param: rows
#> offset_param: offset
#> limit: 50
#> status: not run yet
Now call the HTTP method you want, here using GET
via get()
cc$get('works')
#> OK
The object now has updated status with number of requests done
cc
#> <crul paginator>
#> base url: http://api.crossref.org
#> by: query_params
#> limit_chunk: 10
#> limit_param: rows
#> offset_param: offset
#> limit: 50
#> status: 5 requests done
Other methods to use
# get all responses
cc$responses()
# get all HTTP status objects
cc$status()
# get all HTTP status codes
cc$status_code()
# get all times
cc$times()
# get all raw bytes
cc$content()
# parse all responses
cc$parse()
thoughts?
would love any feedback on this - there’s bound to be lots of edge cases because there’s no one way to make a web API