CSV editor that won't do anything to the data?


#1

In a fit of pique I asked on rOpenSci slack:

Any recommendations for a CSV editor (or spreadsheet program generally) that won’t do anything to the data? I’m at my wits’ end with Excel turning 1:200 into 0.180555 and gene names into dates. But sometimes I still need to manually edit files from collaborators.

And I got a bounty of suggestions from the helpful rOpenSci community. In this question I’ll give my impression of the programs. Further suggestions and feedback are welcome.

What I want

I want to be able to open and edit Excel and CSV files and edit them freely without having any conversions done. Particular problems are conversions of text like 1:200, 1/25, and MARCH1 into numbers or dates. This is a well-known problem.

Yes, yes, in a perfect world I would make all changes to the data with code. But sometimes I receive files from collaborators that are so idiosyncratically formatted that it’s just not worth the effort to clean them up in code.

Testing system

To try out the various solutions, I created a small test.csv containing a bunch of entries that I know have caused problems in the past: gene names that look like dates (“MARCH1”), gene identifiers that look like numbers (“2310009E13”), a date, and a variety of numbers that tend to get reformatted. So my first criterion was that the ideal program would change nothing about any entry in this file.

This is what the raw file looked like:
vim

My second criterion was that I should easily be able to edit the file in question. I tested this by opening an XLSX file from a collaborator and making all the changes it needed.

With those criteria established, I went ahead and checked out the recommended programs, in three categories: spreadsheet programs, CSV-focused solutions, and other solutions.

Category 1: Spreadsheet programs

Spreadsheet programs are the obvious choice for editing CSV files, but they are also what led me here in the first place.

I tried this out with Excel, Google Sheets, and LibreOffice Calc, with similar results across all three. Each mangled parts of my test file, but each in its own way.

Google sheets:

LibreOffice calc:

On the positive side, it was easy to edit other-peoples-data.xlsx in all of these programs, though they did all mangle the data I entered when it was something like 1.00 or 1:200.

One possible solution to the issues of data conversion with spreadsheet programs is, on import, to tell the program to interpret all columns as text. And indeed that works fine to prevent data mangling, as shown below:

Google sheets, no conversion:

LibreOffice calc, no conversion:

But even though this could avoid mangling the data on import, data that I typed in would be reformatted. I was disappointed that this solution didn’t work, because honestly I’d like to keep using what I’ve been using with a tweak to the settings.

CSV-focused editors

Comma Chameleon brags on its website about how little it lets you do to your files: just open, add and remove rows/columns, and save. It has some cool other features, like data validation and data package creation, that I won’t cover here.

As advertised, Comma Chameleon worked well: it opened my test file without problems and worked fine for editing other-peoples-data.xlsx

Jenny Bryan and Noam Ross on rOpenSci slack reported having tried Comma Chameleon about a year ago and having some problems and finding it a bit clunky, but being really glad that the developers were making it. In the year since they tried it, a bunch of progress seems to have been made (based on the github page). It looks like like some of the issues Jenny and Noam reported have been resolved. It mostly just worked for me, so it’s definitely worth a try.

Tad is another CSV-focused program, particularly designed for fast viewing/sorting/filtering of large csv files. It opened my test file without problems:

But it is read-only (and won’t open XLSX files) so wouldn’t work for my editing needs. I could imagine using it for viewing files with thousands of rows though.

Other programs

A bunch of other programs were suggested that don’t neatly fit into one category, so I’m lumping them together here.

Stencila seems like an interesting program, calling itself “the office suite for reproducible research”. It seems to be in early stages of development and, as far as I can tell, doesn’t yet have a way to open CSV or XLSX files. But I really like the idea of an office suite designed with reproducible research in mind.

Two people suggested using your favorite text editor with a CSV plugin. In particular, the tablr plugin for the Atom text editor and csv-mode in Emacs were recommended. The tablr demo looks pretty slick. I don’t have any experience with Atom or emacs, so I didn’t try either of these, but I think they might work pretty well for the right person.

Airtable, a cloud-based database tool consisting of tables you can edit, was also mentioned and seems like a cool and useful program, but the idea of uploading and downloading a file every time was too much overhead for me.

Daff (data diff) also got a mention. While it doesn’t solve the particular problem I have here, it’s really cool! It is an improved diff for CSV files that understands rows and columns, not just lines, so you can more easily see what actually changed in the file.

Conclusions

I’m going with Comma Chameleon. It’s easy to use and doesn’t mess up my data.

I think that a CSV plugin for your favorite text editor would probably work well for a lot of people, just not for me.

Thank you

Suggestions and ideas came in a torrent from a bunch of great rOpenSci people. Thank you to Bryce Mecum, Stefanie Butland, Scott Chamberlain, Jenny Bryan, Noam Ross, Jennifer Thompson, Ben Raymond, and Carl Boettiger.


#2

Great question and even better writeup!


#3

Great question an great write-up; will definitely point people to this thread next time this discussion comes up.

Does have the unfortunate side-effect of making me really want something like Comma Chameleon as an RStudio extension. (and no, utils::edit() doesn’t count)…


#4

for what it is worth, we just reviewed this internally and Comma Chameleon removed some potentially “unnecessary” quotes from our existing CSV that are useful for dealing with data we import into Impala. Also, it replaced all of the line endings with DOS line endings, which may mess with CSV parsers under Unix. Just keep that in mind if it looks like it affects you. Otherwise it seemed fine.


#5

What I do use use excel. Let me explain.

Make a copy of your csv but make it a txt file, then open it with the text import and make all fields text. Then it will not mangle your data. When your done do a save as and choose csv.