Extracting text from a pdf with 2 columns

zoowalk · March 1, 2023, 11:45am

Hello,

I was wondering whether anyone here in this forum has some idea, suggestion etc. how to solve the following issue:

I am working with pdf documents which contain text in a 2 column format (which is sometimes interrupted by headings). See for an example here. To extract the text column-wise I use the tabulizer package and it’s extract_text function. This works generally very fine, but there are some documents/pages where it fails to recognize the separation into two columns. Hence, the text is wrongly extracted.

my_link <- "https://www.parlament.gv.at/dokument/V/NRSITZ/116/imfname_141323.pdf"  

tabulizer::extract_tables(
    file = my_link ,
    pages=c(3),
    columns=list(3),
    guess=FALSE,
    method="stream",
    encoding = "UTF-8"
  )

Line 1 of page 3 of the sample document reads as:

[3,] “Der Verkehrsausschuß hat sich mitd er Regie Antrag der Abg. Widmayer , Cerny, Hintern”

The text is extracted without respecting the column break.

I also tried tesseract and tabulizer’s extract_table, but unfortunately with no success.

I am now not sure how to proceed, but was wondering whether there is any viable approach to make use of the line in the middle, separating the columns as some sort of indicator.

Would be grateful for any hint. Many thanks.
ro

zoowalk · March 1, 2023, 12:32pm

Think I just found a solution: Tesseract’s

options = list(tessedit_pageseg_mode=1)

did the trick.

Topic		Replies	Views
tabulizer for parsing block-text from .pdf UseCases package , tabulizer	1	1412	February 1, 2020
Extracting Text from Invoices document using Bound box and paste in Excel Package Use Questions text-mining , tesseract , tabulizer	5	470	April 28, 2022
Pdftools 2.0: powerful pdf text extraction tools Blog	13	1937	December 5, 2021
pdftools for extracting complex (e.g. text-wrapped/multiline) tables from pdfs UseCases r , pdftools , tidyverse	0	2168	January 26, 2021
Using Tesseract with Page Segmentation Mode 0 for Orientation and script detection (OSD) Package Use Questions tesseract	1	8024	March 15, 2018

Extracting text from a pdf with 2 columns

Related topics