Extracting Text from Invoices document using Bound box and paste in Excel

MAAbdullah47 · April 26, 2022, 10:33am

Hello All,
I am trying to read an .PNG, JPEG image and need to extract the text from that image. The extracted information is Invoices image I want to write those information in Excel sheet. I re-used the code published from the provider:
ExtractTable - API to convert image to excel, extract tables from PDF , here is the code:

## Load required R packages (must be installed first)
install.packages(c("magrittr", "jsonlite", "httr"))
require(magrittr)
require(jsonlite)
require(httr)


# Main Functions

## Parse Server Response
parseResponse <- function(server_resp) {return(fromJSON(content(server_resp, "text", encoding="UTF-8")))}


## Function to Check credits usage
check_credits <- function(api_key) {
  validate_endpoint = 'https://validator.extracttable.com'
  return(content(GET(url = validate_endpoint, add_headers(`x-api-key` = api_key)), as = 'parsed', type = 'application/json'))
}

## Function to Retrieve the result by JobId
retrieve_result <- function(api_key, job_id) {
  retrieve_endpoint = "https://getresult.extracttable.com"
  return(
    GET(
      url = paste0(retrieve_endpoint, "/?JobId=", job_id),
      add_headers(`x-api-key` = api_key)
    )
  )
}


## Function to trigger a file for extraction
proces_file <- function(api_key, filepath) {
  trigger_endpoint = "https://trigger.extracttable.com"
  return (
    POST(
      url = trigger_endpoint,
      add_headers(`Content-Type`="multipart/form-data", `x-api-key` = api_key),
      body = list(input = upload_file(filepath))
    )
  )
}


## Function to extract all tables from the input file
ExtractTable <- function(filepath, api_key) {
  server_response <- proces_file(api_key, filepath)
  parsed_resp = parseResponse(server_response)
  
  
  # Wait for a maximum of 5 minutes to finish the trigger job
  # Retries every 20 seconds
  max_wait_time = 5*60
  retry_interval = 20
  while (parsed_resp$JobStatus == 'Processing' & max_wait_time >= 0) {
    max_wait_time = max_wait_time - retry_interval
    print(paste0("Job is still in progress. Let's wait for ", retry_interval, " seconds"))
    Sys.sleep(retry_interval)
    server_response <- retrieve_result(api_key, job_id=parsed_resp$JobId)
    parsed_resp = parseResponse(server_response)
  }
  
  ### Parse the response for tables
  et_tables <- content(server_response, as = 'parsed', type = 'application/json')
  
  all_tables <- list()
  
  if (tolower(parsed_resp$JobStatus) != "success") {
    print(paste0("The processing was NOT SUCCESSFUL Below is the complete response from the server"))
    print(parsed_resp)
    return(all_tables)
  }
  
  ### Convert the extracted tabular JSON data as a dataframe for future use
  ### Each data frame represents one table
  for (i in 1:length(et_tables$Table)) {
    all_tables[[i]] <- sapply(et_tables$Tables[[i]]$TableJson, unlist) %>% t() %>% as.data.frame()
  }
  
  return(all_tables)
  
} #end of function



# Usage

## Intialize valid API key received from https://extracttable.com
api_key = ""

# Validate or check credits of the API key
credits <- check_credits(api_key = api_key)$usage


input_location = "E:/OCR Test/Test Bill.jpeg"
Excel_location = "E:/OCR Test/"

# Trigger the job for processing and get results as an array of dataframes
# Each data frame represents one table
results <- ExtractTable(api_key = api_key, filepath = input_location)
Size<-length(results)
i=1
for(i in 1:Size) {
  # No<-as.character(i)
  write.xlsx2(results[[i]], paste(Excel_location, "data_all.xlsx"), row.names = FALSE, sheetName = paste("Sheet", as.character(i), sep=""), append = TRUE)  # Append other data frames
}

I’m using also (tesseract) to do some processes including (arabic language) , I need to draw a Bound box on the Invoice to extract the Column (Information needed) and paste it in Excel sheet , can any one help me on how to draw Bound box on the Invoice image and paste it on excel sheet?

Note : Attached a sample for the Invoice where I want to paste the column bounded in red need to be paste in the excel sheet.

mpadge · April 26, 2022, 10:57am

The tabulizer package can help read tables from pdfs, and the magick pacakge can be used to reduce images to smaller regions (bounding boxes). See also this rOpenSci blog post about reading tables with pdftools which might help.

MAAbdullah47 · April 27, 2022, 9:04am

Thank you @mpadge

Used I think only for PDF but not for Image (Please confirm).
For magick package I didn’t find any tutorial for processing Text from Image all I find is image processing without text , if you have Tutorial link please send it to me I’ll be appreciated.
I need also to know how to deal with Images contains two languages e.g. (Arabic/English) , for example:
esults_OCR1 <- tesseract::ocr_data("E:/OCR Test/Test Bill.jpeg", engine = ara)

How I also include English for extraction in addition to Arabic?

where can I find good tutorial for learning R with (OpenCV) for extracting the text from Image OCR and parse the extracted text?

mpadge · April 27, 2022, 12:02pm

If you’re working purely with image files, then tesseract is your only real choice for text extraction via OCR. I’m not sure whether the R package permits multiple languages, but the engine itself certainly does. If the R package does not expose this ability, you’ll have to directly call the engine, for which see these docs. Note that those docs also clearly explain how you can convert the image to a pdf with OCR text that you could maybe then use as input to tabulizer.

MAAbdullah47 · April 27, 2022, 2:47pm

Thank you @mpadge for the last documentation Initially I think it worth reading , but I have question ,
I want to read Image from my machine by writing this syntax:

eng <- tesseract("eng")
ara <- tesseract("ara")
results_OCR2 <- tesseract::ocr_data("E:/OCR Test/Test 2.jpg", engine = ara)
Bound_OCR2<-as.data.frame(results_OCR2)

How I translate the R code above into the command line in the document link by using this syntax:

tesseract images/eurotext.png - -l eng+ara

steffilazerte · April 28, 2022, 5:28pm

Hi @MAAbdullah47 great question!

I’ve moved this topic to “Package Use Questions” so others have a better chance of finding it. If you figure out your problem using tesseract or tabulizer (or any other rOpenSci package), consider adding how you did so to “UseCases” as an example! Thanks

Topic		Replies	Views
detect bounding box text or non-text using tesseract Blog tesseract	0	482	April 27, 2022
pdftools + tesseract para extraer texto en español UseCases tesseract , pdftools , spanish	0	1379	July 15, 2021
Extracting text from a pdf with 2 columns General Q&A magick , tesseract , tabulizer	1	327	March 1, 2023
tabulizer for parsing block-text from .pdf UseCases package , tabulizer	1	1413	February 1, 2020
Using pdftools, tabulizer, and writexl to simplify business information handling workflow UseCases writexl , pdftools , tabulizer	3	2288	August 30, 2019

Extracting Text from Invoices document using Bound box and paste in Excel

Related topics