(Generic function/package for) Mapping non-ascii characters to nearest ascii versions?

cboettig · April 2, 2015, 9:03pm

Hi folks,

I’ve never really figured out the right/robust way to deal with non-ascii data in a way that makes R happy (R check in particular).

I think what I often want to do is simply replace the non-ascii characters with their nearest ASCII equivalents, e.g. very much like this example: http://stackoverflow.com/questions/10704661/replace-non-ascii-chars-with-a-defined-string-list-without-a-loop-in-r

One might think that a comprehensive map of this sort existed, and this sounds like what “iconv()” function is supposed to do; but as far as I can tell from my experiments it doesn’t – at least for me it just gives NAs.

I’m wondering if anyone has already written a generic function like this that could just strip off accent marks from the non-ascii characters. I’ve just written my own helper function based on that example to do this:

find_non_ascii <- function(string){
  grep("I_WAS_NOT_ASCII", 
       iconv(string, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))

}

replace_non_ascii <-function(string){
    i <- find_non_ascii(string)
    non_ascii <- "áéíóúÁÉÍÓÚñÑüÜ’åôö"
    ascii <- "aeiouAEIOUnNuU'aoo"
    translated <- sapply(string[i], function(x) 
                         chartr(non_ascii, ascii, x))
    string[i] <- unname(translated)
    string
}

Certainly this could be improved upon as I don’t think my list is comprehensive, it just covers the characters I’ve hit so far. Perhaps this would be of more general use if it were in devtools or something? I imagine it would primarily be handy in data-raw functions. Thoughts?

karthik · April 2, 2015, 9:25pm

I have some functions in testdat to deal with this. See ropensci/testdat

But this sort of minor annoyance was part of my motivation for that package as a whole, which might get merged with another parallel piece of work. A set of functions to both test and fix.

cboettig · April 2, 2015, 9:45pm

Hi Karthik,

Sounds good. Maybe testdat would be a good home for these functions then?

It looks to me like the current functions in testdat (e.g. test_utf8 and sanitize_text) aren’t actually doing any mapping though, but rather just drop the characters entirely.

karthik · April 2, 2015, 9:55pm

Correct. But the goal is to eventually fix those. For now we needed a way to proceed with data analysis without getting bogged down. I plan to merge this functionality into a non-ropensci package (but get that author to port over) and that would be the place to add it. In the meantime please feel free to send a PR to testdat.

noamross · April 2, 2015, 11:19pm

Is Richard Iannone’s UnidecodeR what you’re looking for?

sckott · April 2, 2015, 11:36pm

nice, that looks like what carl had in mind

cboettig · April 2, 2015, 11:44pm

Bingo! that’s awesome. many thanks!

It doesn’t cover some common unicode characters that aren’t accents (e.g.
curly quotations) and has some performance issues, but I think those are
easily fixed.

time to fork…

sckott · April 2, 2015, 11:55pm

2

jeroenooms · April 3, 2015, 5:13am

Funny how these problems keep coming back: http://stackoverflow.com/questions/18123501/replacing-accented-characters-with-plain-ascii-ones/

thosjleeper · April 3, 2015, 12:07pm

I think you can do this in iconv if you translate to something like “ascii/transliterate”. I can’t remember the exact option.

Edit: Here’s an SO answer about it: http://stackoverflow.com/a/13610611/2338862 that includes the following example:

iconv(c("über","Sjögren's"), to="ASCII//TRANSLIT")
[1] "uber"      "Sjogren's"

karthik · May 27, 2015, 2:17am

Damn, totally missed this awesome thread. I never seem to get any discuss notifications.

Topic		Replies	Views
NOTE on UTF-8 strings by `goodpractice::gp()` Package Development r , package , software-peer-review	5	2225	July 10, 2020
pdftools converting hieroglyph Package Use Questions	5	787	May 12, 2019
audio file format and codec conversion using R Package Use Questions	1	723	November 21, 2018
R interface with Hunspell Wishlist	6	2419	March 11, 2016
An unified interface to convert numeral bases in R Package Use Questions	1	963	July 20, 2016

(Generic function/package for) Mapping non-ascii characters to nearest ascii versions?

Related topics