Last week Google and friends released the new major version of their OCR system: Tesseract 4. This release builds upon 2+ years of hard work and has completely overhauled the internal OCR engine. From the tesseract wiki:
Tesseract 4.0 includes a new neural network-based recognition engine that delivers significantly higher accuracy (on document images) than the previous versions, in return for a significant increase in required compute power. On complex languages however, it may actually be faster than base Tesseract.
We have now also updated the R package tesseract to ship with the new Tesseract 4 on MacOS and Windows. It uses the new engine by default, and the results are extremely impressive! Recognition is much more accurate then before, even without manually enhancing the image quality.
Read more: rOpenSci | Tesseract 4 is here! State of the art OCR in R!