Using Tesseract with Page Segmentation Mode 0 for Orientation and script detection (OSD)

griga · March 13, 2018, 9:14pm

Hi I’m hoping someone may be able to help with some advice. I am looking to ocr a scanned pdf document using tesseract where the pages could be in any orientation. When I run tesseract with OSD I get an error about language not available.

Full sample code:

if (!require(“pacman”)) install.packages(“pacman”)
p_load(tesseract, pdftools, magick)

convert a simple scanned pdf to confirm it’s working

pdf_convert(“https://idrh.ku.edu/sites/idrh.ku.edu/files/files/tutorials/pdf/Non-text-searchable.pdf”, format = “png”, pages = NULL, filenames = NULL, dpi = 300)
ocr1 <- ocr(“Non-text-searchable_1.png”)
cat(ocr1)

use image magick to rotate the image file as sample of what I will typically get

img1 <- image_read(“Non-text-searchable_1.png”)
img1r <- image_rotate(img1, 90)
image_write(img1r, path = “Non-text-searchable_1r.png”, format = “png”)

install the osd language and set up an engine with the osd option turned on

tesseract_download(lang = “osd”, datapath = NULL, progress = TRUE)
osdeng <- tesseract(language = “eng”, datapath = NULL, configs = NULL, cache = TRUE,
options = list(tessedit_pageseg_mode = “0”))

try ocr the rotated image with osd

ocr2 <- ocr(“Non-text-searchable_1r.png”, engine=osdeng)
cat(ocr2)

When I run the final ocr step (ocr2 <- …) I get the error:

Failed loading language ‘osd’
Tesseract couldn’t load any languages!
Warning: Auto orientation and script detection requested, but osd language failed to load

Any suggestions on what I may be doing wrong or alternatives would be much appreciated.

jeroenooms · March 15, 2018, 10:47pm

Can you open an issue on Github? I am guessing you’re on Windows?

Topic		Replies	Views
pdftools + tesseract para extraer texto en español UseCases tesseract , pdftools , spanish	0	1379	July 15, 2021
detect bounding box text or non-text using tesseract Blog tesseract	0	472	April 27, 2022
Tesseract and Magick: High Quality OCR in R Blog r , magick , technotes , ocr , tesseract	0	903	August 17, 2017
Extracting text from a pdf with 2 columns General Q&A magick , tesseract , tabulizer	1	323	March 1, 2023
Tesseract 4 is here! State of the art OCR in R! Blog ocr	0	621	November 5, 2018

Using Tesseract with Page Segmentation Mode 0 for Orientation and script detection (OSD)

convert a simple scanned pdf to confirm it’s working

use image magick to rotate the image file as sample of what I will typically get

install the osd language and set up an engine with the osd option turned on

try ocr the rotated image with osd

Related topics