Here is a quick way to have Omniscope extracting text from PDFs document by leveraging the Python block in the data workflow.
You just need to add a Python block on the workflow and copy paste the following script.
#This is a solution to extract text from all PDFs files present in a folder
#You just need to adapt the script by setting the 'pdf_dir' param, that's it.
#When you execute you will have 1 document per row, FileName and Text field containing the
# PDF file name and the extracted text.
#Prerequisite: You need to install PyPDF2 module
#To install it run "pip install PyPDF2" from the command line. See Python block instructions tab for more info.
import pandas as pd
import PyPDF2
import glob
pdf_dir = "c:/Users/Antonio/Desktop/"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
output_data = pd.DataFrame(index = [0], columns = ['FileName','Text'])
fileIndex = 0
for file in pdf_files:
pdfFileObj = open(file,'rb') #'rb' for read binary mode
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
startPage = 0
text = ''
cleanText = ''
while startPage <= pdfReader.numPages-1:
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.split()
newRow = pd.DataFrame(index = [0], columns = ['FileName', 'Text'])
newRow.iloc[0]['FileName'] = file
newRow.iloc[0]['Text'] = text
output_data = pd.concat([output_data, newRow], ignore_index=True)
The text will be then available, depending on the PDF structure as comma separated words (but can't guarantee as it is Python library specific behaviour). But surely you can use the Preparation blocks e.g. search/replace, detokenise, to clean up the data and extract / utilise the information you like.
Antonio Poggi posted
over 5 years ago
AdminBest Answer
After looking into this I have realised R offers a better package, and a more powerful set of tools to extract the text. PDF files might have different encoding and encryption algorithms, not supported by the aforementioned Python library. R "pdftools" package instead results more powerful and simpler to use.
For instance to extract the text of a PDF, you just need to add a R block in Omniscope and paste this code:
library(pdftools)
text <- pdf_text('C:/Users/Antonio/Desktop/sample.pdf', upw="somePassword")
output.data <- text
( The "upw" (user password) is an optional param you can omit if not needed.)
Omniscope will automatically install the R packages required to execute the script.
You can then export the output as text file with the File Output block, or decide to use the dataset directly in your workspace.
Find the IOZ attached with the script and workflow to extract words
After looking into this I have realised R offers a better package, and a more powerful set of tools to extract the text. PDF files might have different encoding and encryption algorithms, not supported by the aforementioned Python library. R "pdftools" package instead results more powerful and simpler to use.
For instance to extract the text of a PDF, you just need to add a R block in Omniscope and paste this code:
library(pdftools)
text <- pdf_text('C:/Users/Antonio/Desktop/sample.pdf', upw="somePassword")
output.data <- text
( The "upw" (user password) is an optional param you can omit if not needed.)
Omniscope will automatically install the R packages required to execute the script.
You can then export the output as text file with the File Output block, or decide to use the dataset directly in your workspace.
Find the IOZ attached with the script and workflow to extract words
Here is a quick way to have Omniscope extracting text from PDFs document by leveraging the Python block in the data workflow.
You just need to add a Python block on the workflow and copy paste the following script.
The text will be then available, depending on the PDF structure as comma separated words (but can't guarantee as it is Python library specific behaviour). But surely you can use the Preparation blocks e.g. search/replace, detokenise, to clean up the data and extract / utilise the information you like.
I have added a demo project to show how you can analyse frequency of words from an online PDF document here : https://omniscope.me/Forums/Extract+Text+from+PDF.iox/
Please share your thoughts or questions.
0 Votes
Antonio Poggi posted over 5 years ago Admin Best Answer
After looking into this I have realised R offers a better package, and a more powerful set of tools to extract the text. PDF files might have different encoding and encryption algorithms, not supported by the aforementioned Python library.
R "pdftools" package instead results more powerful and simpler to use.
For instance to extract the text of a PDF, you just need to add a R block in Omniscope and paste this code:
( The "upw" (user password) is an optional param you can omit if not needed.)
Omniscope will automatically install the R packages required to execute the script.
You can then export the output as text file with the File Output block, or decide to use the dataset directly in your workspace.
Find the IOZ attached with the script and workflow to extract words
Attachments (1)
R PDF Reader.ioz
142 KB
0 Votes
3 Comments
Antonio Poggi posted over 5 years ago Admin
Find the project attached as IOZ for your convenience at the bottom of this post
Attachments (1)
Extract Text....ioz
144 KB
0 Votes
Antonio Poggi posted over 5 years ago Admin Answer
After looking into this I have realised R offers a better package, and a more powerful set of tools to extract the text. PDF files might have different encoding and encryption algorithms, not supported by the aforementioned Python library.
R "pdftools" package instead results more powerful and simpler to use.
For instance to extract the text of a PDF, you just need to add a R block in Omniscope and paste this code:
( The "upw" (user password) is an optional param you can omit if not needed.)
Omniscope will automatically install the R packages required to execute the script.
You can then export the output as text file with the File Output block, or decide to use the dataset directly in your workspace.
Find the IOZ attached with the script and workflow to extract words
Attachments (1)
R PDF Reader.ioz
142 KB
0 Votes
Antonio Poggi posted almost 5 years ago Admin
This is now available as a Community custom block in Omniscope Evo
0 Votes
Login or Sign up to post a comment