Using Tesseract with Page Segmentation Mode 0 for Orientation and script detection (OSD)

tesseract
Tags: #<Tag:0x00007f57f850f1e8>

#1

Hi I’m hoping someone may be able to help with some advice. I am looking to ocr a scanned pdf document using tesseract where the pages could be in any orientation. When I run tesseract with OSD I get an error about language not available.

Full sample code:

if (!require(“pacman”)) install.packages(“pacman”)
p_load(tesseract, pdftools, magick)

convert a simple scanned pdf to confirm it’s working

pdf_convert(“https://idrh.ku.edu/sites/idrh.ku.edu/files/files/tutorials/pdf/Non-text-searchable.pdf”, format = “png”, pages = NULL, filenames = NULL, dpi = 300)
ocr1 <- ocr(“Non-text-searchable_1.png”)
cat(ocr1)

use image magick to rotate the image file as sample of what I will typically get

img1 <- image_read(“Non-text-searchable_1.png”)
img1r <- image_rotate(img1, 90)
image_write(img1r, path = “Non-text-searchable_1r.png”, format = “png”)

install the osd language and set up an engine with the osd option turned on

tesseract_download(lang = “osd”, datapath = NULL, progress = TRUE)
osdeng <- tesseract(language = “eng”, datapath = NULL, configs = NULL, cache = TRUE,
options = list(tessedit_pageseg_mode = “0”))

try ocr the rotated image with osd

ocr2 <- ocr(“Non-text-searchable_1r.png”, engine=osdeng)
cat(ocr2)

When I run the final ocr step (ocr2 <- …) I get the error:

Failed loading language ‘osd’
Tesseract couldn’t load any languages!
Warning: Auto orientation and script detection requested, but osd language failed to load

Any suggestions on what I may be doing wrong or alternatives would be much appreciated.


#2

Can you open an issue on Github? I am guessing you’re on Windows?