Hi I’m hoping someone may be able to help with some advice. I am looking to ocr a scanned pdf document using tesseract where the pages could be in any orientation. When I run tesseract with OSD I get an error about language not available.
Full sample code:
if (!require(“pacman”)) install.packages(“pacman”)
p_load(tesseract, pdftools, magick)
convert a simple scanned pdf to confirm it’s working
pdf_convert(“https://idrh.ku.edu/sites/idrh.ku.edu/files/files/tutorials/pdf/Non-text-searchable.pdf”, format = “png”, pages = NULL, filenames = NULL, dpi = 300)
ocr1 <- ocr(“Non-text-searchable_1.png”)
cat(ocr1)
use image magick to rotate the image file as sample of what I will typically get
img1 <- image_read(“Non-text-searchable_1.png”)
img1r <- image_rotate(img1, 90)
image_write(img1r, path = “Non-text-searchable_1r.png”, format = “png”)
install the osd language and set up an engine with the osd option turned on
tesseract_download(lang = “osd”, datapath = NULL, progress = TRUE)
osdeng <- tesseract(language = “eng”, datapath = NULL, configs = NULL, cache = TRUE,
options = list(tessedit_pageseg_mode = “0”))
try ocr the rotated image with osd
ocr2 <- ocr(“Non-text-searchable_1r.png”, engine=osdeng)
cat(ocr2)
When I run the final ocr step (ocr2 <- …) I get the error:
Failed loading language ‘osd’
Tesseract couldn’t load any languages!
Warning: Auto orientation and script detection requested, but osd language failed to load
Any suggestions on what I may be doing wrong or alternatives would be much appreciated.