Hi folks,
I’ve never really figured out the right/robust way to deal with non-ascii data in a way that makes R happy (R check in particular).
I think what I often want to do is simply replace the non-ascii characters with their nearest ASCII equivalents, e.g. very much like this example: http://stackoverflow.com/questions/10704661/replace-non-ascii-chars-with-a-defined-string-list-without-a-loop-in-r
One might think that a comprehensive map of this sort existed, and this sounds like what “iconv()” function is supposed to do; but as far as I can tell from my experiments it doesn’t – at least for me it just gives NAs.
I’m wondering if anyone has already written a generic function like this that could just strip off accent marks from the non-ascii characters. I’ve just written my own helper function based on that example to do this:
find_non_ascii <- function(string){
grep("I_WAS_NOT_ASCII",
iconv(string, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
}
replace_non_ascii <-function(string){
i <- find_non_ascii(string)
non_ascii <- "áéíóúÁÉÍÓÚñÑüÜ’åôö"
ascii <- "aeiouAEIOUnNuU'aoo"
translated <- sapply(string[i], function(x)
chartr(non_ascii, ascii, x))
string[i] <- unname(translated)
string
}
Certainly this could be improved upon as I don’t think my list is comprehensive, it just covers the characters I’ve hit so far. Perhaps this would be of more general use if it were in devtools or something? I imagine it would primarily be handy in data-raw functions. Thoughts?