(Generic function/package for) Mapping non-ascii characters to nearest ascii versions?

Hi folks,

I’ve never really figured out the right/robust way to deal with non-ascii data in a way that makes R happy (R check in particular).

I think what I often want to do is simply replace the non-ascii characters with their nearest ASCII equivalents, e.g. very much like this example: http://stackoverflow.com/questions/10704661/replace-non-ascii-chars-with-a-defined-string-list-without-a-loop-in-r

One might think that a comprehensive map of this sort existed, and this sounds like what “iconv()” function is supposed to do; but as far as I can tell from my experiments it doesn’t – at least for me it just gives NAs.

I’m wondering if anyone has already written a generic function like this that could just strip off accent marks from the non-ascii characters. I’ve just written my own helper function based on that example to do this:

find_non_ascii <- function(string){
  grep("I_WAS_NOT_ASCII", 
       iconv(string, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))

}

replace_non_ascii <-function(string){
    i <- find_non_ascii(string)
    non_ascii <- "áéíóúÁÉÍÓÚñÑüÜ’åôö"
    ascii <- "aeiouAEIOUnNuU'aoo"
    translated <- sapply(string[i], function(x) 
                         chartr(non_ascii, ascii, x))
    string[i] <- unname(translated)
    string
}

Certainly this could be improved upon as I don’t think my list is comprehensive, it just covers the characters I’ve hit so far. Perhaps this would be of more general use if it were in devtools or something? I imagine it would primarily be handy in data-raw functions. Thoughts?

I have some functions in testdat to deal with this. See ropensci/testdat

But this sort of minor annoyance was part of my motivation for that package as a whole, which might get merged with another parallel piece of work. A set of functions to both test and fix.

Hi Karthik,

Sounds good. Maybe testdat would be a good home for these functions then?

It looks to me like the current functions in testdat (e.g. test_utf8 and sanitize_text) aren’t actually doing any mapping though, but rather just drop the characters entirely.

Correct. But the goal is to eventually fix those. For now we needed a way to proceed with data analysis without getting bogged down. I plan to merge this functionality into a non-ropensci package (but get that author to port over) and that would be the place to add it. In the meantime please feel free to send a PR to testdat.

Is Richard Iannone’s UnidecodeR what you’re looking for?

1 Like

nice, that looks like what carl had in mind

Bingo! that’s awesome. many thanks!

It doesn’t cover some common unicode characters that aren’t accents (e.g.
curly quotations) and has some performance issues, but I think those are
easily fixed.

time to fork…

:clock12: 2 :fork_and_knife:

Funny how these problems keep coming back: http://stackoverflow.com/questions/18123501/replacing-accented-characters-with-plain-ascii-ones/

I think you can do this in iconv if you translate to something like “ascii/transliterate”. I can’t remember the exact option.

Edit: Here’s an SO answer about it: http://stackoverflow.com/a/13610611/2338862 that includes the following example:

iconv(c("über","Sjögren's"), to="ASCII//TRANSLIT")
[1] "uber"      "Sjogren's"

Damn, totally missed this awesome thread. I never seem to get any discuss notifications. :frowning: