In the package pdftools
, there are two functions pdf_data() and pdf_ocr_data().
pdf_data() results in a list of tibbles, each with 6 fields: width, height, x, y, space, and text.
pdf_ocr_data() results in a list of tibbles with 3 fields: word, confidence, and bbox.
There is very little documenation on the pdf_ocr_data() function and I’m having trouble figuring out what the elements of bbox are. Does anyone know what that represents? It seems to have something to do with the word coordinates but the same word from the same page will return different results when imported with pdf_ocr_data() instead of pdf_data().