Skip to content

PDF

Find and fix non-searchable PDFs

I use a ScanSnap ix500 scanner to scan a lot of paper into PDFs on my iMac. And thanks to the ScanSnap's bundled optical character recognition (OCR), all of those scans are searchable via Spotlight. While the OCR may not be perfect, it's generally more than good enough to find what I'm looking for.

However, I noticed that I had a number of PDFs that weren't searchable—some electronic statements from credit cards and utility companies, and some older documents that predated my purchase of the ScanSnap, at least based on some tests with Spotlight.

But I wanted to know how many such PDFs I had, so I could run OCR on all of them, via the excellent PDFPen Pro app. (The Fujitsu's software will only perform OCR on documents it scanned.) The question was how to find all such files, and then once found, how to most easily run them through PDFPen Pro's OCR process.

In the end, I needed to install one set of Unix tools, and then write two small scripts—one shell script and one AppleScript. Of course, you'll also need PDFPen (I don't think Pro is required), or some other app that can perform OCR on PDF files.

[continue reading…]



Hardware: Fujitsu ScanSnap iX500 document scanner

In mid-2015, I decided I wanted to get rid of the mass of paper we'd been accumulating for years. Much of it could be recycled, but there was still a substantial stack of important yet rarely looked at paper that we needed to keep. If anything was ripe for a digitization project, it was this stack of paper. But there were thousands of pages to scan, and that's not something you're going to want to do on your $99 all-in-one printer/scanner/coffee maker.

After talking with some people and reading some reviews, I bought a Fujitsu ScanSnap iX500 document scanner. This was not an inexpensive purchase—it lists for nearly $500, though typically sells for just over $400.

Note that there are two versions of this scanner: The PA03656-B005, which is what I have, and the newer PA03656-B305. The newer one is actually less expensive ($415 vs $490 as I write this), and apparently the sole difference is the bundled third-party software. I haven't seen the newer scanner's bundle, though, so I can't comment.

I've been using this scanner pretty much every day since October of 2015, and I can say it's one of the best pieces of hardware I've ever purchased. (The software is also very good, but the UI is far from lovely.) So far, I've scanned over 8,500 pages with this scanner, and I haven't had any issues with it at all. If you're interested in document scanning, read on for my thoughts on why this Fujitsu is an excellent tool for the task…

[continue reading…]



Count pages in all PDFs within a folder structure

Please see this newer post, with a new script that provides subtotals by subfolder, which is what I really wanted when I wrote this one.

Recently I've been trying to go paperless (well, mostly paperless) via a Fujitsu ScanSanp ix500. (I'll have more to say about the scanner in a future post).

One way to go paperless is to just go from now forward—start scanning stuff and don't worry about history. I decided that I'd go the other route, and work through our old paper files: some would be scanned and kept, much would just be recycled. The process went really quickly, compared to what I had expected. It helps that the Fujitsu is a wicked-fast document scanner!

But I was curious about how much I was scanning, in terms of total PDF pages—not files, but counting the pages in the files. Spotlight to the rescue; the field kMDItemNumberOfPages returns the number of pages in a document, and it seemed accurate in testing via mdls:

$ mdls /path/to/somefile.pdf | grep kMDItemNumberOfPages
kMDItemNumberOfPages = 4

So I set out to write a script to traverse my "Scans" folder, and return the total number of PDF pages.

[continue reading…]