pdftools converting hieroglyph

#1

Hello. Previously, pdftools encoded the hieroglyphs in this format <U+52D9>, and now like this 零. How can I go back to the first format?

2 Likes
#2

thanks for your question @Alex

any ideas on this @jeroenooms ?

#3

Can you please include example code and an example pdf and specify when was previously? Which version number of pdftools/r/windows?

#4

Code is simple:
txt <- pdf_text(path)

previous configuration:
ubuntu 16.04
R version 3.4.4 (2018-03-15)
Pdftools 1.8,

now:
Ubuntu 18.04.2 LTS
R version 3.4.4 (2018-03-15)
Pdftools 2.1
I installed previous versions of libraries, it did not help.
Link on all files https://yadi.sk/d/gVPpSmpMzDyl2Q
I not found how attach it here.

#5

I think the difference is in your locale, not the version of pdftools. R automatically escapes non-ascii strings when you are in C locale.

 txt <- pdf_text(path)
 print(txt[2)

And now try this:

Sys.setlocale(locale = "C")
print(txt[2)

However I would not recommend this. If you really want to get escape sequences you could use stringi:

stringi::stri_escape_unicode(x[2])

That should properly escape utf-8 characters on any locale.

1 Like
#6

Thank you. This Sys.setlocale(locale = “C”) help me.

1 Like