Extracting text from a pdf with 2 columns

Hello,

I was wondering whether anyone here in this forum has some idea, suggestion etc. how to solve the following issue:

I am working with pdf documents which contain text in a 2 column format (which is sometimes interrupted by headings). See for an example here. To extract the text column-wise I use the tabulizer package and it’s extract_text function. This works generally very fine, but there are some documents/pages where it fails to recognize the separation into two columns. Hence, the text is wrongly extracted.

my_link <- "https://www.parlament.gv.at/dokument/V/NRSITZ/116/imfname_141323.pdf"  

tabulizer::extract_tables(
    file = my_link ,
    pages=c(3),
    columns=list(3),
    guess=FALSE,
    method="stream",
    encoding = "UTF-8"
  )

Line 1 of page 3 of the sample document reads as:

[3,] “Der Verkehrsausschuß hat sich mitd er Regie­ Antrag der Abg. Widmayer , Cerny, Hintern­”

The text is extracted without respecting the column break.

I also tried tesseract and tabulizer’s extract_table, but unfortunately with no success.

I am now not sure how to proceed, but was wondering whether there is any viable approach to make use of the line in the middle, separating the columns as some sort of indicator.

Would be grateful for any hint. Many thanks.
ro

Think I just found a solution: Tesseract’s

options = list(tessedit_pageseg_mode=1)

did the trick.

2 Likes