detect bounding box text or non-text using tesseract

MAAbdullah47 · April 27, 2022, 3:19pm

Hi I have 2 questions
Q1 I need a code showing me:

eng <- tesseract("eng")
ara <- tesseract("ara")
whitelist <- "1234567890-.,;:أةؤب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ا @ß€!$%&/()=?+"
text1 <- ocr("E:/OCR Test/Test Bill.jpeg",
             engine = tesseract(language = "ara",
                                options = list(tessedit_char_whitelist = whitelist)))

Please refer to the attached Image below :
how to search for a text extracted from (text1) e.g.: VAT. 30047 ,
and then give me the X1&Y1&X2&Y2 coordinates of word that I want to search it inside text1 (VAT .) ? Then how to know the coordinates of the Number after the word (VAT .) combine it together in 1 or 2 strings After acknowledging the position text then copy it in another storing buffer e.g.: Data frame ?

Q.2
I have another question if the Image contains 2 languages English & Arabic how I set bot languages in (text1) above?

Topic		Replies	Views
Extracting Text from Invoices document using Bound box and paste in Excel Package Use Questions text-mining , tesseract , tabulizer	5	469	April 28, 2022
Using Tesseract with Page Segmentation Mode 0 for Orientation and script detection (OSD) Package Use Questions tesseract	1	8013	March 15, 2018
How to specify certain parameters in package tesseract Package Use Questions	1	474	December 17, 2020
pdftools + tesseract para extraer texto en español UseCases tesseract , pdftools , spanish	0	1379	July 15, 2021
Extracting text from a pdf with 2 columns General Q&A magick , tesseract , tabulizer	1	323	March 1, 2023

detect bounding box text or non-text using tesseract

Related topics