In a fit of pique I asked on rOpenSci slack:
Any recommendations for a CSV editor (or spreadsheet program generally) that won’t do anything to the data? I’m at my wits’ end with Excel turning 1:200 into 0.180555 and gene names into dates. But sometimes I still need to manually edit files from collaborators.
And I got a bounty of suggestions from the helpful rOpenSci community. In this question I’ll give my impression of the programs. Further suggestions and feedback are welcome.
What I want
I want to be able to open and edit Excel and CSV files and edit them freely without having any conversions done. Particular problems are conversions of text like 1:200, 1/25, and MARCH1 into numbers or dates. This is a well-known problem.
Yes, yes, in a perfect world I would make all changes to the data with code. But sometimes I receive files from collaborators that are so idiosyncratically formatted that it’s just not worth the effort to clean them up in code.
To try out the various solutions, I created a small
test.csv containing a bunch of entries that I know have caused problems in the past: gene names that look like dates (“MARCH1”), gene identifiers that look like numbers (“2310009E13”), a date, and a variety of numbers that tend to get reformatted. So my first criterion was that the ideal program would change nothing about any entry in this file.
This is what the raw file looked like:
My second criterion was that I should easily be able to edit the file in question. I tested this by opening an XLSX file from a collaborator and making all the changes it needed.
With those criteria established, I went ahead and checked out the recommended programs, in three categories: spreadsheet programs, CSV-focused solutions, and other solutions.
Category 1: Spreadsheet programs
Spreadsheet programs are the obvious choice for editing CSV files, but they are also what led me here in the first place.
I tried this out with Excel, Google Sheets, and LibreOffice Calc, with similar results across all three. Each mangled parts of my test file, but each in its own way.
On the positive side, it was easy to edit
other-peoples-data.xlsx in all of these programs, though they did all mangle the data I entered when it was something like 1.00 or 1:200.
One possible solution to the issues of data conversion with spreadsheet programs is, on import, to tell the program to interpret all columns as text. And indeed that works fine to prevent data mangling, as shown below:
Google sheets, no conversion:
LibreOffice calc, no conversion:
But even though this could avoid mangling the data on import, data that I typed in would be reformatted. I was disappointed that this solution didn’t work, because honestly I’d like to keep using what I’ve been using with a tweak to the settings.
Comma Chameleon brags on its website about how little it lets you do to your files: just open, add and remove rows/columns, and save. It has some cool other features, like data validation and data package creation, that I won’t cover here.
As advertised, Comma Chameleon worked well: it opened my test file without problems and worked fine for editing
Jenny Bryan and Noam Ross on rOpenSci slack reported having tried Comma Chameleon about a year ago and having some problems and finding it a bit clunky, but being really glad that the developers were making it. In the year since they tried it, a bunch of progress seems to have been made (based on the github page). It looks like like some of the issues Jenny and Noam reported have been resolved. It mostly just worked for me, so it’s definitely worth a try.
Tad is another CSV-focused program, particularly designed for fast viewing/sorting/filtering of large csv files. It opened my test file without problems:
But it is read-only (and won’t open XLSX files) so wouldn’t work for my editing needs. I could imagine using it for viewing files with thousands of rows though.
A bunch of other programs were suggested that don’t neatly fit into one category, so I’m lumping them together here.
Stencila seems like an interesting program, calling itself “the office suite for reproducible research”. It seems to be in early stages of development and, as far as I can tell, doesn’t yet have a way to open CSV or XLSX files. But I really like the idea of an office suite designed with reproducible research in mind.
Two people suggested using your favorite text editor with a CSV plugin. In particular, the tablr plugin for the Atom text editor and csv-mode in Emacs were recommended. The tablr demo looks pretty slick. I don’t have any experience with Atom or emacs, so I didn’t try either of these, but I think they might work pretty well for the right person.
Airtable, a cloud-based database tool consisting of tables you can edit, was also mentioned and seems like a cool and useful program, but the idea of uploading and downloading a file every time was too much overhead for me.
Daff (data diff) also got a mention. While it doesn’t solve the particular problem I have here, it’s really cool! It is an improved diff for CSV files that understands rows and columns, not just lines, so you can more easily see what actually changed in the file.
I’m going with Comma Chameleon. It’s easy to use and doesn’t mess up my data.
I think that a CSV plugin for your favorite text editor would probably work well for a lot of people, just not for me.
Suggestions and ideas came in a torrent from a bunch of great rOpenSci people. Thank you to Bryce Mecum, Stefanie Butland, Scott Chamberlain, Jenny Bryan, Noam Ross, Jennifer Thompson, Ben Raymond, and Carl Boettiger.