Tesseract 4 is here! State of the art OCR in R!

jeroenooms · November 5, 2018, 12:01pm

Last week Google and friends released the new major version of their OCR system: Tesseract 4. This release builds upon 2+ years of hard work and has completely overhauled the internal OCR engine. From the tesseract wiki:

Tesseract 4.0 includes a new neural network-based recognition engine that delivers significantly higher accuracy (on document images) than the previous versions, in return for a significant increase in required compute power. On complex languages however, it may actually be faster than base Tesseract.

We have now also updated the R package tesseract to ship with the new Tesseract 4 on MacOS and Windows. It uses the new engine by default, and the results are extremely impressive! Recognition is much more accurate then before, even without manually enhancing the image quality.

Read more: rOpenSci | Tesseract 4 is here! State of the art OCR in R!

Topic		Replies	Views
Tesseract and Magick: High Quality OCR in R Blog r , magick , technotes , ocr , tesseract	0	909	August 17, 2017
Using Tesseract with Page Segmentation Mode 0 for Orientation and script detection (OSD) Package Use Questions tesseract	1	8064	March 15, 2018
Updates to the rOpenSci image suite: magick, tesseract, and av Blog magick , tesseract , images , av	0	636	September 27, 2019
detect bounding box text or non-text using tesseract Blog tesseract	0	494	April 27, 2022
pdftools + tesseract para extraer texto en español UseCases tesseract , pdftools , spanish	0	1382	July 15, 2021

Tesseract 4 is here! State of the art OCR in R!

Related topics