The Robservatory

Robservations on everything…

 

Find and fix non-searchable PDFs

I use a ScanSnap ix500 scanner to scan a lot of paper into PDFs on my iMac. And thanks to the ScanSnap’s bundled optical character recognition (OCR), all of those scans are searchable via Spotlight. While the OCR may not be perfect, it’s generally more than good enough to find what I’m looking for.

However, I noticed that I had a number of PDFs that weren’t searchable—some electronic statements from credit cards and utility companies, and some older documents that predated my purchase of the ScanSnap, at least based on some tests with Spotlight.

But I wanted to know how many such PDFs I had, so I could run OCR on all of them, via the excellent PDFPen Pro app. (The Fujitsu’s software will only perform OCR on documents it scanned.) The question was how to find all such files, and then once found, how to most easily run them through PDFPen Pro’s OCR process.

In the end, I needed to install one set of Unix tools, and then write two small scripts—one shell script and one AppleScript. Of course, you’ll also need PDF Pen (I don’t think Pro is required), or some other app that can perform OCR on PDF files.

The first challenge was identifying PDFs that weren’t searchable. My first thought was the files’ metadata, but comparing a non-searchable and searchable PDF revealed nothing usable. Then, on Twitter, Michael Wood had a suggestion:

hmmm… maybe you could cobble something together with “pdftotext” to see if it contains text.

The pdftotext tool he mentions is an open source tool that can extract text data from PDF files. In theory, I could check the extracted text of a PDF—if there wasn’t any, then it wasn’t searchable. But there were some issues with that, because it can sometimes return text from images, for instance.

But pdftotext is also available as part of XpdfReader, a bigger package of PDF tools. And XpdfReader is available within Homebrew, which is how I install most Unix tools nowadays. So I installed the full set, via brew install xpdf, and took a look at what was available.

After doing some testing, it turned out that pdffonts was the best tool for the job. This little program reports on the fonts within a PDF. If there are no fonts, that means it’s not a searchable PDF. When there are no fonts in the PDF, the last two lines of output from pdffonts looked like this:

name                                 type              emb sub uni prob object ID
------------------------------------ ----------------- --- --- --- ---- ---------

To find all my unsearchable PDFs, I just had to check the last line of pdffonts output to see if it was all dashes. Using my revised PDF page counting script as a starting point, and as before with some optimization help from James, here’s the final script:

Save this somewhere on your path, make it excecutable, then run it in whatever folder you wish to search. The output should be a list of files, with full paths, that are not searchable:

$ pdfnosearch
/path/to/___Credit Cards/Statements/zPrior Years/2016-11 - CC Statement.pdf
/path/to/__Monthly Bills/Frontier FIOS/zPrior Years/2014-12 - Frontier FIOS.pdf
/path/to/__Monthly Bills/Frontier FIOS/zPrior Years/2015-01 - Frontier FIOS.pdf
/path/to/__Monthly Bills/Frontier FIOS/zPrior Years/2015-02 - Frontier FIOS.pdf
/path/to/__Monthly Bills/Frontier FIOS/zPrior Years/2015-03 - Frontier FIOS.pdf
/path/to/__Monthly Bills/NW Natural/zPrior Years/2017-07 - Northwest Natural.pdf
/path/to/__Monthly Bills/NW Natural/zPrior Years/2017-08 - Northwest Natural.pdf
/path/to/__Monthly Bills/NW Natural/zPrior Years/2017-09 - Northwest Natural.pdf
/path/to/__Monthly Bills/NW Natural/zPrior Years/2017-10 - Northwest Natural.pdf
etc...
$

Once I had my list of unsearchable files, I wanted a relatively easy way to batch convert them to searchable PDFs. I didn’t want to bother trying to automate this step (i.e. find then convert), because I’d spend way more time trying to get that working right than I’d say by having written it.

PDFPen Pro doesn’t have a batch function, nor a command line interface. But what it does have is AppleScript support, so I started with this AppleScript I found on the web. I then greatly simplified it into a simple little droplet to run PDFPen’s OCR on any PDFs dropped onto the script:

Save the script as an application, and you can drag and drop PDFs onto its icon; PDFPen Pro will then open and run OCR on each file.

Using these two scripts, I was able to find and fix about 100 PDFs with a minimal amount of work. I still have a few troublesome PDFs that claim to be searchable (which means they have a text layer, so PDFPen Pro won’t OCR them), but there’s no text to actually be found. So I guess it’s on to the next challenge…

1 Comment

Add a Comment
  1. I added (you may wish to substitute) the following beneath the “printf” line (line 13).

    open -a /path/to/PDFPenOCR.app $i

    where “/path/to/PDFPenOCR.app” is the path to your PDFPen AppleScript.

    Here’s a simple AppleScript droplet source for making these sorts of shell scripts executable.

    on open the_items
    my execabilify(the_items)
    end open

    on execabilify(fl)
    repeat with i in fl
    try
    do shell script “chmod ” & “755” & space & quoted form of POSIX path of i with administrator privileges
    on error the error_message number the error_number
    display dialog “Error: ” & the error_number & “. ” & the error_message buttons {“Cancel”} default button 1
    end try
    end repeat
    end execabilify

Leave a Reply

The Robservatory © 2017 Built from the Frontier theme