NOTE on UTF-8 strings by `goodpractice::gp()`

While checking the package taxlist using goodpractice::gp() and rhub::check(platform = "debian-clang-devel") I’m getting a NOTE:

  • checking data for non-ASCII characters … NOTE
    Note: found 137 marked UTF-8 strings

This is clearly due to special characters in the data Easplist (Easplist@taxonNames$AuthorName). Though I don’t get such note using devtools::check_built() with my local settings and I assume, this is not a major issue in other platforms, I’m still struggling to get a final opinion about:

  1. Is there a way to get rid of this note or should I just ignore it?
  2. If the encoding of the package is declared as UTF-8, why UTF-8 strings are still noted by the check?

This issue is on discussion here and here.

  1. Is there a way to get rid of this note or should I just ignore it?

You can always edit your data frame. IIRC, I’ve done that using stri_enc_toascii: Convert To ASCII in stringi: Fast and Portable Character String Processing Facilities once.

  1. If the encoding of the package is declared as UTF-8, why UTF-8 strings are still noted by the check?

As I have already commented here, you will get the note if you use R CMD build + R CMD check (I got the note using this combo, but I do not get the note if I use devtools::check()). Looks like devtools::check() does not check datasets by default (i.e. tools:::.check_package_datasets(".") is not performed).

Hope this could be useful.

2 Likes

Thank you for your comments, @KevCaz

I got the issue of different outputs by different checks, but my question is getting more a philosophical one (see below).

First some experiments. As you said, the problem is the example data set and more specifically, the author names, which contain special characters in UTF-8:

library(stringi)
library(taxlist)

Names <- Easplist@taxonNames$AuthorName[c(5299, 5021, 5019)]
Names
#> [1] "Borsch, Kai Müll. & Eb.Fisch." "Ruiz & Pav."                  
#> [3] "Ség."

stri_enc_mark(Names)
#> [1] "UTF-8" "ASCII" "UTF-8"

iconv(Names, "utf8", "ascii")
#> [1] NA            "Ruiz & Pav." NA           

stri_enc_toascii(Names)
#> [1] "Borsch, Kai M\032ll. & Eb.Fisch." "Ruiz & Pav."                     
#> [3] "S\032g."                         

stri_trans_general(Names, "latin-ascii")
#> [1] "Borsch, Kai Mull. & Eb.Fisch." "Ruiz & Pav."                  
#> [3] "Seg."                         

In my opinion, the last could be the more convenient way to transform the problematic vectors but from the taxonomic point of view, this is part of the identity of a taxon usage name, thus a modification in the contained information (perhaps equivalent to round decimal numbers for numerical purposes).

If everyone see what I see in my screen disregarding on local settings (the code block in this message includes console outputs), I don’t know why I should modify this information.

One additional question: Did I get it right that the encoding declaration in the metadata of the package is only concerning the documentation but not the distributed data?

1 Like

I think you are making good points and I think you should write them down (in the cran-comments file) before submitting your package to the CRAN as suggested in this thread.

I guess the philosophical problem remains that for some people with a setup that doesn’t include a font that renders those non-ascii characters properly, this “part of the identity of the taxon” will not be preserved. But the only alternative would be to use \uxxxx escapes (see Writing R Extensions) which would suppress the note, but does not resolve the philosophical issue (well, it actually does for people that know by heart the unicode table :sweat_smile:).

As for your question, I think you are right, but it is not 100% clear to me whether the encoding mentioned in description file must apply to the datasets.

A last comment, you can quickly identify non-ascii character with tools::showNonASCIIfile("data/Easplist.rda")

2 Likes

Thank you, again, for the spiritual support. I have to note, that I’m self-made programmer and a lot have been done by try-and-error.

I’ll do as recommended by adding a comment at submission. In the last release at CRAN encoding issue was considered as solved (before values with UTF-8 symbols were marked as ‘unknown’). In the worst case I will apply one of the transformations hopping that no-one will have any troubles when working with own encoding settings…

1 Like

From my little experience with CRAN, the team is very helpful especially when you explain the reasons behind a given choice. You have clearly been working very carefully on your package, if they are not happy with the note, I am convinced they will guide you and I would be very happy to hear about the solution they recommend (if any)!