A new rOpenSci package provides access to data to which users may already have directly contributed, and for which contribution is fun, keeps you fit, and helps made the world a better place. The data come from using public bicycle hire schemes, and the package is called bikedata. Public bicycle hire systems operate in many cities throughout the world, and most systems collect (generally anonymous) data, minimally consisting of the times and locations at which every single bicycle trip starts and ends. The bikedata package provides access to data from all cities which openly publish these data, currently including London, U.K., and in the U.S.A., New York, Los Angeles, Philadelphia, Chicago, Boston, and Washington DC. The package will expand as more cities openly publish their data (with the newly enormously expanded San Francisco system next on the list).
This looks like a really cool dataset, but given it’s individuals’ spatial trip data, I’d like to know whether the package maintainers or the constituent dataset providers do much to deidentify the trips beyond just taking obvious identifiers like name and DOB off the records. Are the trip start and end points fuzzed at all?
No anonymization is done because there are no names or DOBs. The only personal detail is from those systems which record user-provided year of birth and gender, which the users may provide as they wish.
Note that these individual trips are stored the whole time in an SQLite database, and may be accessed with standard database calls
(for example). In this case, the only individual data are membership categories of 0 (not a system member) and 1 (system member), but other cities will have non-NA values for the other variables too.