Collecting tidy data

If you’re manually collecting data, how do you set yourself up to reduce the amount of time spent cleaning/tidying it afterwards?

Sometimes I find it difficult to imagine what my data would look like in it’s tidy state, e.g. I’ll have a ID variable, then anything between 10-50 variables where each variable has between 1-10 observations per ID. The manual input feels incredibly messy and error prone in itself. Add a few extra collaborators and you’ve got the recipe for disaster. Do you collect data in whatever way seems most natural for your current project, or do you keep it tidy from the beginning? How?

Google spreadsheets and Excel are “good” options - but not very safe in my opinion (I’m still living in a world where Excel files and article feedback gets emailed around with _namedate). I could use google and the googlesheets package to have some level of version control, but I’d prefer something like RedCap. RedCap, however, seems to require a lot of resources and I’m not sure if we’ll be able to use it. Are there any other options that works well with R? Like a lightweight/easier to use version of RedCap with an R API…?

Edit: As pointed out by @rywhale, this post is a cross post from rstudio community, original post here.

@bragks just a head’s up that it’s standard practice around here to flag any cross posts and link the original post so that folks don’t put work into an answer that someone else has provided on the other forum.

Hoping for some good suggestions here though.

There are lots of options in this space that very in the degree of effort, customization, and specificity to R-based analysis. For those collecting data in the field, Open Data Kit and many services built on top of it are common. ODK requires setting up your own server but many of the services, like KoBoToolbox, do not. We use ODK for one project and non-ODK but similar hosted service, EpiCollect, in another.

If you are entering data at a desk, most of these applications are overkill, and their form-based interfaces aren’t ideal. Most people favor some sort of tabular entry in this case. For this, I like Airtable a lot. It provides a familiar, spreadsheet-like interface in which you can tightly control the types of data entered and multiple people can use simultaneously. There’s an R package to interface with it. It has some competitors but I haven’t explored them extensively. You can also get some similar functionality by extensive use of user controls and data validation in Google Sheets. I don’t know of any self-hosted or open source similar options, but Airtable has free plan and you can dump your data any time.

One note is that “tidy” or “long” data is a pretty poor way to enter data. In my experience most people are much more comfortable entering data in “wide” form. But well-designed wide data entry is easy to transform to long form when needed.

Finally, this paper from the Portal Project may be of interest. It describes their collaborative QA/QC workflow for ongoing data entry on a large project.

3 Likes

Good point, thank you!

The Portal Project mention using git and excel with data validation for data entry, which seem like a fairly good approach to me, but my field is medicine - getting a bunch of my colleagues to use and understand git is not something I’d feel particularly compelled to do…:neutral_face: The other options you mention look like great tools, but for my needs they’re a bit too “advanced”. Mostly I’d just want to enter integers/doubles, factors, strings and dates.

I’ve also considered using google spreadsheets with the googlesheets package, but most likely I won’t be able to due to GDPR (and ethical reasons).

Could some sort of SQL database be what I’m looking for? I have no knowledge of SQL, but if there’s an easy way to setup a report form interface (website with login) I think it might be worth learning the neccessary basics.

Summarised, for my field of research, the optimal workflow would look something like this:

Customisable, but simple data entry interface ->
Database (preferably local unless very secure and somehow approved within GDPR) with API to R ->
Import to R ->
Prewritten assertr script so I can hunt people down when they mess up

A SQL database with a web front-end could do what you’re looking for. I’ve done this in the past for a project that spanned several countries where partners often didn’t have mobile reception in the rice field. So they collected data and entered it in a portal when they were done. The database was Postgres as I recall. My PhD student wrote an R package to access the database and work with the data in R from there. I had to a full-stack web developer to build that for me and deploy it on Amazon Web Services. Security is definitely not something you want to second guess, so you need someone who knows what they’re doing in that area.

Just adding to what Noam said, another option is Fulcrum, https://www.fulcrumapp.com/. I’ve used ODK for a large-countrywide project in the Philippines and am getting set up to use Fulcrum here in Oz. ODK is free, but as pointed out, you have to set up your own server and it only works on Android. Fulcrum does the server work for you and works on iOS and Android but you’ll pay for it. Currently I’m only setting up the forms to be used. Very flexible and easy to use. Data ends up in the typical wide format, but that’s easy to tidy up in R.

1 Like

I’d be really interested to see any example of someone making a data entry front-end to a database has a familiar spreadsheet-like interface with type restrictions along the lines of Airtable (or rowshare.com, which I just discovered). There are lots of form-based solutions out there, both self-hosted or otherwise, ODK-based or not, but in my experience they are never any good for bulk data entry. A big project I work on has a series of excel templates that can be uploaded and automatically ingested into a portal backed by an SQL database - this was implemented after many staff rebelled at the inefficiency of a form-based UI.

1 Like

I’m quite surprised there’s no catch all solution for this already. Surely there are lots of people doing bulk entry of data on a daily basis, they can’t all be using Excel? Feels like this could’ve been the perfect package: help you deploy a datebase and configure your frontend interface through some sort of rmarkdown/blogdown solution. If only I had the time, and skills…

Setting up a database and a front end entry portal via shiny (or Node.js or plain JavaScript or whatever) is easy enough. The hard part is making that front end accessible to the people you want to fill in the data without having a dedicated server (aka paying for it)… The best R-native solution I can think of is to use a shiny server setup (or shinyapps.io).

Like @noamross said, there are lots of form-based entry methods (e.g.- Google Forms --> Google sheets --> R via googlesheets package). This gets super annoying though if you want to customize/autofill answers or have any sort of dynamic elements.

I’d more than willing to help out with any project that helps to address this issue but I don’t think it’s going to be easy…

A project which may be interesting to follow is goodtables.io
It allows continuous integrated test of the tidyness of the tables