pdftools + tesseract para extraer texto en español

silviaegt · July 15, 2021, 2:03am

Convertí un texto-imagen en pdf a un texto legible para computadoras usando el OCR de Tesseract y la función de pdf_ocr_text()

academic / non-profit

humanidades ¡y cualquier otra disciplina que use pdfs!

me fascina lo que hacen: ¡gracias @rOpenSci-Staff!, estaría increíble poder entrenar modelos para mejorar el OCR

Topic		Replies	Views
Using Tesseract with Page Segmentation Mode 0 for Orientation and script detection (OSD) Package Use Questions tesseract	1	8069	March 15, 2018
Extracting Text from Invoices document using Bound box and paste in Excel Package Use Questions text-mining , tesseract , tabulizer	5	494	April 28, 2022
detect bounding box text or non-text using tesseract Blog tesseract	0	501	April 27, 2022
pdftools for parsing tables from many .pdfs UseCases package , pdftools	1	1981	March 31, 2020
Tesseract and Magick: High Quality OCR in R Blog r , magick , technotes , ocr , tesseract	0	912	August 17, 2017