NOTE on UTF-8 strings by `goodpractice::gp()`

kamapu · July 8, 2020, 11:10am

While checking the package taxlist using goodpractice::gp() and rhub::check(platform = "debian-clang-devel") I’m getting a NOTE:

checking data for non-ASCII characters … NOTE
Note: found 137 marked UTF-8 strings

This is clearly due to special characters in the data Easplist (Easplist@taxonNames$AuthorName). Though I don’t get such note using devtools::check_built() with my local settings and I assume, this is not a major issue in other platforms, I’m still struggling to get a final opinion about:

Is there a way to get rid of this note or should I just ignore it?
If the encoding of the package is declared as UTF-8, why UTF-8 strings are still noted by the check?

This issue is on discussion here and here.

KevCaz · July 9, 2020, 2:19pm

Is there a way to get rid of this note or should I just ignore it?

You can always edit your data frame. IIRC, I’ve done that using stri_enc_toascii: Convert To ASCII in stringi: Fast and Portable Character String Processing Facilities once.

If the encoding of the package is declared as UTF-8, why UTF-8 strings are still noted by the check?

As I have already commented here, you will get the note if you use R CMD build + R CMD check (I got the note using this combo, but I do not get the note if I use devtools::check()). Looks like devtools::check() does not check datasets by default (i.e. tools:::.check_package_datasets(".") is not performed).

Hope this could be useful.

kamapu · July 10, 2020, 9:50am

Thank you for your comments, @KevCaz

I got the issue of different outputs by different checks, but my question is getting more a philosophical one (see below).

First some experiments. As you said, the problem is the example data set and more specifically, the author names, which contain special characters in UTF-8:

library(stringi)
library(taxlist)

Names <- Easplist@taxonNames$AuthorName[c(5299, 5021, 5019)]
Names
#> [1] "Borsch, Kai Müll. & Eb.Fisch." "Ruiz & Pav."                  
#> [3] "Ség."

stri_enc_mark(Names)
#> [1] "UTF-8" "ASCII" "UTF-8"

iconv(Names, "utf8", "ascii")
#> [1] NA            "Ruiz & Pav." NA           

stri_enc_toascii(Names)
#> [1] "Borsch, Kai M\032ll. & Eb.Fisch." "Ruiz & Pav."                     
#> [3] "S\032g."                         

stri_trans_general(Names, "latin-ascii")
#> [1] "Borsch, Kai Mull. & Eb.Fisch." "Ruiz & Pav."                  
#> [3] "Seg."

In my opinion, the last could be the more convenient way to transform the problematic vectors but from the taxonomic point of view, this is part of the identity of a taxon usage name, thus a modification in the contained information (perhaps equivalent to round decimal numbers for numerical purposes).

If everyone see what I see in my screen disregarding on local settings (the code block in this message includes console outputs), I don’t know why I should modify this information.

One additional question: Did I get it right that the encoding declaration in the metadata of the package is only concerning the documentation but not the distributed data?

KevCaz · July 10, 2020, 1:11pm

I think you are making good points and I think you should write them down (in the cran-comments file) before submitting your package to the CRAN as suggested in this thread.

I guess the philosophical problem remains that for some people with a setup that doesn’t include a font that renders those non-ascii characters properly, this “part of the identity of the taxon” will not be preserved. But the only alternative would be to use \uxxxx escapes (see Writing R Extensions) which would suppress the note, but does not resolve the philosophical issue (well, it actually does for people that know by heart the unicode table ).

As for your question, I think you are right, but it is not 100% clear to me whether the encoding mentioned in description file must apply to the datasets.

A last comment, you can quickly identify non-ascii character with tools::showNonASCIIfile("data/Easplist.rda")

kamapu · July 10, 2020, 1:46pm

Thank you, again, for the spiritual support. I have to note, that I’m self-made programmer and a lot have been done by try-and-error.

I’ll do as recommended by adding a comment at submission. In the last release at CRAN encoding issue was considered as solved (before values with UTF-8 symbols were marked as ‘unknown’). In the worst case I will apply one of the transformations hopping that no-one will have any troubles when working with own encoding settings…

KevCaz · July 10, 2020, 2:15pm

From my little experience with CRAN, the team is very helpful especially when you explain the reasons behind a given choice. You have clearly been working very carefully on your package, if they are not happy with the note, I am convinced they will guide you and I would be very happy to hear about the solution they recommend (if any)!

Topic		Replies	Views
(Generic function/package for) Mapping non-ascii characters to nearest ascii versions? Package Use Questions	10	3141	May 27, 2015
CRAN checks data: quick look across all check results UseCases api	5	1103	December 16, 2018
Error During Package Build Package Development	2	64	October 21, 2024
Spelling 1.0: quick and effective spell checking in R Blog r , text	0	660	September 6, 2017
pdftools converting hieroglyph Package Use Questions	5	787	May 12, 2019

NOTE on UTF-8 strings by `goodpractice::gp()`

Related topics