While checking the package taxlist using goodpractice::gp() and rhub::check(platform = "debian-clang-devel") I’m getting a NOTE:
checking data for non-ASCII characters … NOTE
Note: found 137 marked UTF-8 strings
This is clearly due to special characters in the data Easplist (Easplist@taxonNames$AuthorName). Though I don’t get such note using devtools::check_built() with my local settings and I assume, this is not a major issue in other platforms, I’m still struggling to get a final opinion about:
Is there a way to get rid of this note or should I just ignore it?
If the encoding of the package is declared as UTF-8, why UTF-8 strings are still noted by the check?
If the encoding of the package is declared as UTF-8, why UTF-8 strings are still noted by the check?
As I have already commented here, you will get the note if you use R CMD build + R CMD check (I got the note using this combo, but I do not get the note if I use devtools::check()). Looks like devtools::check() does not check datasets by default (i.e. tools:::.check_package_datasets(".") is not performed).
I got the issue of different outputs by different checks, but my question is getting more a philosophical one (see below).
First some experiments. As you said, the problem is the example data set and more specifically, the author names, which contain special characters in UTF-8:
In my opinion, the last could be the more convenient way to transform the problematic vectors but from the taxonomic point of view, this is part of the identity of a taxon usage name, thus a modification in the contained information (perhaps equivalent to round decimal numbers for numerical purposes).
If everyone see what I see in my screen disregarding on local settings (the code block in this message includes console outputs), I don’t know why I should modify this information.
One additional question: Did I get it right that the encoding declaration in the metadata of the package is only concerning the documentation but not the distributed data?
I think you are making good points and I think you should write them down (in the cran-comments file) before submitting your package to the CRAN as suggested in this thread.
I guess the philosophical problem remains that for some people with a setup that doesn’t include a font that renders those non-ascii characters properly, this “part of the identity of the taxon” will not be preserved. But the only alternative would be to use \uxxxx escapes (see Writing R Extensions) which would suppress the note, but does not resolve the philosophical issue (well, it actually does for people that know by heart the unicode table ).
As for your question, I think you are right, but it is not 100% clear to me whether the encoding mentioned in description file must apply to the datasets.
A last comment, you can quickly identify non-ascii character with tools::showNonASCIIfile("data/Easplist.rda")
Thank you, again, for the spiritual support. I have to note, that I’m self-made programmer and a lot have been done by try-and-error.
I’ll do as recommended by adding a comment at submission. In the last release at CRAN encoding issue was considered as solved (before values with UTF-8 symbols were marked as ‘unknown’). In the worst case I will apply one of the transformations hopping that no-one will have any troubles when working with own encoding settings…
From my little experience with CRAN, the team is very helpful especially when you explain the reasons behind a given choice. You have clearly been working very carefully on your package, if they are not happy with the note, I am convinced they will guide you and I would be very happy to hear about the solution they recommend (if any)!