Find and fix non-searchable PDFs

I use a ScanSnap ix500 scanner to scan a lot of paper into PDFs on my iMac. And thanks to the ScanSnap's bundled optical character recognition (OCR), all of those scans are searchable via Spotlight. While the OCR may not be perfect, it's generally more than good enough to find what I'm looking for.

However, I noticed that I had a number of PDFs that weren't searchable—some electronic statements from credit cards and utility companies, and some older documents that predated my purchase of the ScanSnap, at least based on some tests with Spotlight.

But I wanted to know how many such PDFs I had, so I could run OCR on all of them, via the excellent PDFPen Pro app. (The Fujitsu's software will only perform OCR on documents it scanned.) The question was how to find all such files, and then once found, how to most easily run them through PDFPen Pro's OCR process.

In the end, I needed to install one set of Unix tools, and then write two small scripts—one shell script and one AppleScript. Of course, you'll also need PDFPen (I don't think Pro is required), or some other app that can perform OCR on PDF files.

The first challenge was identifying PDFs that weren't searchable. My first thought was the files' metadata, but comparing a non-searchable and searchable PDF revealed nothing usable. Then, on Twitter, Michael Wood had a suggestion:

hmmm... maybe you could cobble something together with "pdftotext" to see if it contains text.

The pdftotext tool he mentions is an open source tool that can extract text data from PDF files. In theory, I could check the extracted text of a PDF—if there wasn't any, then it wasn't searchable. But there were some issues with that, because it can sometimes return text from images, for instance.

But pdftotext is also available as part of XpdfReader, a bigger package of PDF tools. And XpdfReader is available within Homebrew, which is how I install most Unix tools nowadays. So I installed the full set, via brew install xpdf, and took a look at what was available.

After doing some testing, it turned out that pdffonts was the best tool for the job. This little program reports on the fonts within a PDF. If there are no fonts, that means it's not a searchable PDF. When there are no fonts in the PDF, the last two lines of output from pdffonts looked like this:

name                                 type              emb sub uni prob object ID
------------------------------------ ----------------- --- --- --- ---- ---------

To find all my unsearchable PDFs, I just had to check the last line of pdffonts output to see if it was all dashes. Using my revised PDF page counting script as a starting point, and as before with some optimization help from James, here's the final script:

#!/bin/bash

saveIFS=$IFS
IFS=$(echo -en "\n\b")

FilesToCheck=$(find `pwd` -maxdepth 99 -name "*.pdf")

for i in $FilesToCheck
do
   errCheck=$(pdffonts ${i} 2>&1 | tail -1)
   if [[ $errCheck =~ ^- ]]
   then
       printf $i"\n"
   fi
done

IFS=$saveIFS

#!/bin/bash

saveIFS=$IFS

IFS=$(echo -en "\n\b")

FilesToCheck=$(find `pwd` -maxdepth 99 -name "*.pdf")

for i in $FilesToCheck

errCheck=$(pdffonts ${i} 2>&1 | tail -1)

if [[ $errCheck =~ ^- ]]

then

printf $i"\n"

done

IFS=$saveIFS

Save this somewhere on your path, make it excecutable, then run it in whatever folder you wish to search. The output should be a list of files, with full paths, that are not searchable:

$ pdfnosearch
/path/to/___Credit Cards/Statements/zPrior Years/2016-11 - CC Statement.pdf
/path/to/__Monthly Bills/Frontier FIOS/zPrior Years/2014-12 - Frontier FIOS.pdf
/path/to/__Monthly Bills/Frontier FIOS/zPrior Years/2015-01 - Frontier FIOS.pdf
/path/to/__Monthly Bills/Frontier FIOS/zPrior Years/2015-02 - Frontier FIOS.pdf
/path/to/__Monthly Bills/Frontier FIOS/zPrior Years/2015-03 - Frontier FIOS.pdf
/path/to/__Monthly Bills/NW Natural/zPrior Years/2017-07 - Northwest Natural.pdf
/path/to/__Monthly Bills/NW Natural/zPrior Years/2017-08 - Northwest Natural.pdf
/path/to/__Monthly Bills/NW Natural/zPrior Years/2017-09 - Northwest Natural.pdf
/path/to/__Monthly Bills/NW Natural/zPrior Years/2017-10 - Northwest Natural.pdf
etc...
$

Once I had my list of unsearchable files, I wanted a relatively easy way to batch convert them to searchable PDFs. I didn't want to bother trying to automate this step (i.e. find then convert), because I'd spend way more time trying to get that working right than I'd say by having written it.

PDFPen Pro doesn't have a batch function, nor a command line interface. But what it does have is AppleScript support, so I started with this AppleScript I found on the web. I then greatly simplified it into a simple little droplet to run PDFPen's OCR on any PDFs dropped onto the script:

on open droppedItems
    repeat with theFile in droppedItems
        tell application "PDFpenPro"
            open theFile as alias
            tell document 1
                ocr
                repeat while performing ocr
                    delay 1
                end repeat
                delay 1
                close with saving
            end tell
        end tell
    end repeat
end open

on open droppedItems

repeat with theFile in droppedItems

tell application "PDFpenPro"

open theFile as alias

tell document 1

ocr

repeat while performing ocr

delay 1

end repeat

delay 1

close with saving

end tell

end repeat

end open

Save the script as an application, and you can drag and drop PDFs onto its icon; PDFPen Pro will then open and run OCR on each file.

Using these two scripts, I was able to find and fix about 100 PDFs with a minimal amount of work. I still have a few troublesome PDFs that claim to be searchable (which means they have a text layer, so PDFPen Pro won't OCR them), but there's no text to actually be found. So I guess it's on to the next challenge…

1 thought on “Find and fix non-searchable PDFs”

prehensileblog Apr 12 '18 at 12:31 pm
I added (you may wish to substitute) the following beneath the "printf" line (line 13).
open -a /path/to/PDFPenOCR.app $i
where "/path/to/PDFPenOCR.app" is the path to your PDFPen AppleScript.
Here's a simple AppleScript droplet source for making these sorts of shell scripts executable.
on open the_items
my execabilify(the_items)
end open
on execabilify(fl)
repeat with i in fl
try
do shell script "chmod " & "755" & space & quoted form of POSIX path of i with administrator privileges
on error the error_message number the error_number
display dialog "Error: " & the error_number & ". " & the error_message buttons {"Cancel"} default button 1
end try
end repeat
end execabilify

Comments are closed.

Find and fix non-searchable PDFs

Related Posts:

1 thought on “Find and fix non-searchable PDFs”