Start a new topic
Answered

Extract text from PDF files

Here is a quick way to have Omniscope extracting text from PDFs document by leveraging the Python block in the data workflow.


You just need to add a Python block on the workflow and copy paste the following script.


#This is a solution to extract text from all PDFs files present in a folder
#You just need to adapt the script by setting the 'pdf_dir' param, that's it. 
#When you execute you will have 1 document per row, FileName and Text field containing the 
# PDF file name and the extracted text.
#Prerequisite: You need to install PyPDF2 module 
#To install it run "pip install PyPDF2" from the command line. See Python block instructions tab for more info.


import pandas as pd
import PyPDF2
import glob

pdf_dir = "c:/Users/Antonio/Desktop/"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)

output_data = pd.DataFrame(index = [0], columns = ['FileName','Text'])
fileIndex = 0

for file in pdf_files:

  pdfFileObj = open(file,'rb')     #'rb' for read binary mode
  pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

  startPage = 0
  text = ''
  cleanText = ''
  while startPage <= pdfReader.numPages-1:
    pageObj = pdfReader.getPage(startPage)
    text += pageObj.extractText()
    startPage += 1
  pdfFileObj.close()
  for myWord in text:
    if myWord != '\n':
      cleanText += myWord
  text = cleanText.split()
  newRow = pd.DataFrame(index = [0], columns = ['FileName', 'Text'])
  newRow.iloc[0]['FileName'] = file
  newRow.iloc[0]['Text'] = text
  output_data = pd.concat([output_data, newRow], ignore_index=True)




The text will be then available, depending on the PDF structure as comma separated words (but can't guarantee as it is Python library specific behaviour). But surely you can use the Preparation blocks e.g. search/replace, detokenise, to clean up the data and extract / utilise the information you like.



I have added a demo project to show how you can analyse frequency of words from an online PDF document here :  https://omniscope.me/Forums/Extract+Text+from+PDF.iox/ 


Please share your thoughts or questions.


Best Answer

After looking into this I have realised R offers a better package, and a more powerful set of tools to extract the text. PDF files might have different encoding and encryption algorithms, not supported by the aforementioned Python library.
R "pdftools" package instead results more powerful and simpler to use.


For instance to extract the text of a PDF, you just need to add a R block in Omniscope and paste this code:

 

library(pdftools)
text <- pdf_text('C:/Users/Antonio/Desktop/sample.pdf', upw="somePassword")
output.data <- text


( The "upw" (user password) is an optional param you can omit if not needed.)


Omniscope will automatically install the R packages required to execute the script.


You can then export the output as text file with the File Output block, or decide to use the dataset directly in your workspace.


Find the IOZ attached with the script and workflow to extract words

ioz

Find the project attached as IOZ for your convenience at the bottom of this post

ioz
Answer

After looking into this I have realised R offers a better package, and a more powerful set of tools to extract the text. PDF files might have different encoding and encryption algorithms, not supported by the aforementioned Python library.
R "pdftools" package instead results more powerful and simpler to use.


For instance to extract the text of a PDF, you just need to add a R block in Omniscope and paste this code:

 

library(pdftools)
text <- pdf_text('C:/Users/Antonio/Desktop/sample.pdf', upw="somePassword")
output.data <- text


( The "upw" (user password) is an optional param you can omit if not needed.)


Omniscope will automatically install the R packages required to execute the script.


You can then export the output as text file with the File Output block, or decide to use the dataset directly in your workspace.


Find the IOZ attached with the script and workflow to extract words

ioz
Login or Signup to post a comment