Text vs Word xy Coordinate Differences Between pdf_data() and pdf_ocr_data()

lfish · June 13, 2023, 2:16am

In the package pdftools, there are two functions pdf_data() and pdf_ocr_data().
pdf_data() results in a list of tibbles, each with 6 fields: width, height, x, y, space, and text.
pdf_ocr_data() results in a list of tibbles with 3 fields: word, confidence, and bbox.

There is very little documenation on the pdf_ocr_data() function and I’m having trouble figuring out what the elements of bbox are. Does anyone know what that represents? It seems to have something to do with the word coordinates but the same word from the same page will return different results when imported with pdf_ocr_data() instead of pdf_data().

Topic		Replies	Views
Pdftools 2.0: powerful pdf text extraction tools Blog	13	1953	December 5, 2021
tabulizer for parsing block-text from .pdf UseCases package , tabulizer	1	1413	February 1, 2020
pdftools + tesseract para extraer texto en español UseCases tesseract , pdftools , spanish	0	1379	July 15, 2021
Extracting Text from Invoices document using Bound box and paste in Excel Package Use Questions text-mining , tesseract , tabulizer	5	472	April 28, 2022
pdftools for extracting complex (e.g. text-wrapped/multiline) tables from pdfs UseCases r , pdftools , tidyverse	0	2178	January 26, 2021

Text vs Word xy Coordinate Differences Between pdf_data() and pdf_ocr_data()

Related topics