Data from Public Bicycle Hire Systems

Author: Mark Padgham

A new rOpenSci package provides access to data to which users may already have directly contributed, and for which contribution is fun, keeps you fit, and helps made the world a better place. The data come from using public bicycle hire schemes, and the package is called bikedata. Public bicycle hire systems operate in many cities throughout the world, and most systems collect (generally anonymous) data, minimally consisting of the times and locations at which every single bicycle trip starts and ends. The bikedata package provides access to data from all cities which openly publish these data, currently including London, U.K., and in the U.S.A., New York, Los Angeles, Philadelphia, Chicago, Boston, and Washington DC. The package will expand as more cities openly publish their data (with the newly enormously expanded San Francisco system next on the list).

…


Read the rest at https://ropensci.org/blog/blog/2017/10/17/bikedata

This looks like a really cool dataset, but given it’s individuals’ spatial trip data, I’d like to know whether the package maintainers or the constituent dataset providers do much to deidentify the trips beyond just taking obvious identifiers like name and DOB off the records. Are the trip start and end points fuzzed at all?

Oops! Should’ve read the package README a bit more closely—it just exposes aggregate numbers of trips between stations. Sick! :smiley:

1 Like

No anonymization is done because there are no names or DOBs. The only personal detail is from those systems which record user-provided year of birth and gender, which the users may provide as they wish.

Note that these individual trips are stored the whole time in an SQLite database, and may be accessed with standard database calls

> db <- dplyr::src_sqlite(bikedb, create = FALSE)
> dplyr::collect (dplyr::tbl (db, "trips"))
# A tibble: 1,175,305 x 11
      id  city trip_duration          start_time           stop_time start_station_id end_station_id bike_id user_type birth_year gender
   <int> <chr>         <int>               <chr>               <chr>            <chr>          <chr>   <chr>     <chr>      <chr>  <chr>
 1     1    ph           660 2017-01-01 00:05:00 2017-01-01 00:16:00           ph3046         ph3041                 1       <NA>   <NA>
 2     2    ph          2160 2017-01-01 00:21:00 2017-01-01 00:57:00           ph3110         ph3054                 0       <NA>   <NA>
 3     3    ph          2100 2017-01-01 00:22:00 2017-01-01 00:57:00           ph3110         ph3054                 0       <NA>   <NA>
 4     4    ph           720 2017-01-01 00:27:00 2017-01-01 00:39:00           ph3041         ph3005                 1       <NA>   <NA>
 5     5    ph           480 2017-01-01 00:28:00 2017-01-01 00:36:00           ph3047         ph3124                 0       <NA>   <NA>
 6     6    ph           420 2017-01-01 00:29:00 2017-01-01 00:36:00           ph3047         ph3124                 0       <NA>   <NA>
 7     7    ph           540 2017-01-01 00:31:00 2017-01-01 00:40:00           ph3072         ph3068                 1       <NA>   <NA>
 8     8    ph           960 2017-01-01 00:34:00 2017-01-01 00:50:00           ph3033         ph3114                 1       <NA>   <NA>
 9     9    ph          1140 2017-01-01 00:38:00 2017-01-01 00:57:00           ph3013         ph3028                 0       <NA>   <NA>
10    10    ph          1020 2017-01-01 00:40:00 2017-01-01 00:57:00           ph3013         ph3028                 0       <NA>   <NA>
# ... with 1,175,295 more rows

(for example). In this case, the only individual data are membership categories of 0 (not a system member) and 1 (system member), but other cities will have non-NA values for the other variables too.

I apologize for the completely unrelated question, but… what font are you using? I can see the font supports ligatures!