Skip to content

shell script

Find and fix non-searchable PDFs

I use a ScanSnap ix500 scanner to scan a lot of paper into PDFs on my iMac. And thanks to the ScanSnap's bundled optical character recognition (OCR), all of those scans are searchable via Spotlight. While the OCR may not be perfect, it's generally more than good enough to find what I'm looking for.

However, I noticed that I had a number of PDFs that weren't searchable—some electronic statements from credit cards and utility companies, and some older documents that predated my purchase of the ScanSnap, at least based on some tests with Spotlight.

But I wanted to know how many such PDFs I had, so I could run OCR on all of them, via the excellent PDFPen Pro app. (The Fujitsu's software will only perform OCR on documents it scanned.) The question was how to find all such files, and then once found, how to most easily run them through PDFPen Pro's OCR process.

In the end, I needed to install one set of Unix tools, and then write two small scripts—one shell script and one AppleScript. Of course, you'll also need PDFPen (I don't think Pro is required), or some other app that can perform OCR on PDF files.

[continue reading…]



Revisiting a PDF page counting script

A couple of years back, I created a bash script to count PDF pages across subfolders. Here's how it looks when run on my folder of Apple manuals:

I use this script on the top-level folder where I save all my Fujitsu ScanSnap iX500 scans. Why? Partly because I'm a geek, and partly because it helps me identify folders I might not need to keep on their own—if there are only a few pages in a folder, I'll generally try to consolidate its contents into another lightly-used folder.

The script I originally wrote worked fine, and still works fine—sort of. When I originally wrote about it, I said…

I feared this would be incredibly slow, but it only took about 40 seconds to traverse a folder structure with about a gigabyte of PDFs in about 1,500 files spread across 160 subfolders, and totalling 5,306 PDF pages.

That was then, this is now: With 12,173 pages of PDFs spread across 4,475 files in 295 folders, the script takes over two minutes to run—155 seconds, to be precise. That's not anywhere near acceptable, so I set out to see if I could improve my script's performance.

In the end, I succeeded—though it was more of a "we succeeded" thing, as my friend James (who uses a very similar scan-and-file setup) and I went back-and-forth with changes over a couple days. The new script takes just over 10 seconds to count pages in the same set of files. (It's even more impressive if the files aren't so spread out—my eBooks/Manuals folder has over 12,000 pages, too, but in just 139 files in 43 folders…the script runs in just over a second.)

Where'd the speed boost come from? One simple change that seems obvious in hindsight, but I was amazed actually worked…

[continue reading…]



Adjusting for the oddities of ctime

In the shell script I use to back up my web sites (I really should update that, they're much different now), I include a line that trims the backup folder of older compressed backups of the actual WordPress databases. That line used to look like this:

find path/to/sqlfiles/backups -ctime +5 -delete

I thought this should delete all backups in that folder that are at least five days old, via the ctime +5 bit.1Footnote: I know now I should have been using mtime, though it would have had the same issue I had with ctime. But it turns out I thought wrong. The above will delete all files that are at least six days old. Why? I don't know why it works this way, but it's mostly explained in the man page for find (my emphasis added):

-ctime n[smhdw] If no units are specified, this primary evaluates to true if the difference between the time of last change of file status information and the time find was started, rounded up to the next full 24-hour period, is n 24-hour periods. If units are specified, this primary evaluates to true if the difference between the time of last change of file status information and the time find was started is exactly n units. Please refer to the -atime primary description for information on supported time units.

To make find do what I wanted it to do, I just needed to change +5 to +5d. Simple enough…but while figuring this out, I stumbled across this page, which has an alternative solution with more flexibility:

find path/to/sqlfiles/backups -mmin +$((60*24*5)) -delete

The mmin parameter is much more precise than ctime:

-mmin n True if the difference between the file last modification time and the time find was started, rounded up to the next full minute, is n minutes.

By using mmin, I can be really precise. As shown, 60*24*5 gets me the same five-day interval as ctime +5d. (And yes, I could have used 7200 instead of 60*24*5, but I find it clearer to leave it in its expanded form.)

But I could instead delete backups that were older than 3.25 days (60*24*3.25 or 5040), or for any other arbitrary time period. I like the flexibility this offers over ctime, so I've switched my script over to this form.



Change shell scripts based on where they run

This is one of those "oh duh!" things that I wish I'd realized earlier. I have a few shell scripts that I'd like to keep on the Many Tricks cloud server, as I'd like to use them on multiple Macs.

But depending on which Mac is running the script, I might need to use unique code. The path to my Dropbox folder, for example, is different on my laptop and my iMac. So if I want to reference the path to my Dropbox folder, it needs to be different on each Mac. I couldn't figure out how to make that happen with just one script, so I'd been using near-identical versions on each Mac.

Then I remembered the hostname command, which returns the name of the machine running the command:

$ hostname
Robs-rMBP.local

And that was the tidbit of "duh!" knowledge I needed. With that, and the case statement, I can make my shell scripts use code based on which machine runs them. For instance, I can set unique paths for the script that grabs the latest versions of our apps from our server:

myhost=`hostname`
case $myhost in
  Robs-iMac.local) theHub=/path/to/apps/on/manytricks/cloud ;
                   theDest=/path/to/local/copy/of/apps ;;
 
  Robs-rMBP.local) theHub=/different/path/to/apps/on/manytricks/cloud ;
                   theDest=/other/path/to/local/copy/of/apps ;;

                *) echo "Sorry, unrecognized Mac." ;
                   exit ;;

  cp $theHub/$appname $theDest/$appname
  etc

Another nice thing about this is the script won't run on a Mac I haven't set up yet, thanks to the #) bit. And if I happen to rename one of my Macs, the script will also fail to run, letting me know I need to update the name in the script.

A simple tip, but one I'd managed to overlook for years. Now that I've written it up, that shouldn't happen again.



Semi-automatic Homebrew and video-transcode updates

As I've written about in the past, I use Don Melton's video transcoding tools to rip Blu-Ray discs. I also use Homebrew to install some of the transcode video dependencies, as well as other Unix tools.

Keeping these tools current isn't overly difficult; it only requires a few commands in Terminal:

$ brew update
$ brew upgrade
$ sudo gem update video_transcoding

My problem is that I often forget to do this, because—unlike most GUI Mac apps and the Mac App Store—there's no built-in "hey, there's an update!" system. Suddenly, two months and many revisions later, I finally remember (usually when I see a tweet about a new version of something.) So I thought I'd try to write my own simple update reminder.

I didn't really want a scheduled task, like a launchd agent—it's not like these tools need to stay current on a daily basis. (And one of them needs to run with admin privileges, which complicates things.) I just wanted something that would remind me if it'd been a while since I last checked for updates, and then install the updates if I wanted it to do so.

After mulling it over, I came up with a script that runs each time I open a Terminal window (which I do daily). The referenced script looks at the date on a check file, and asks me if I'd like to check for updates if that date is more than a week older than today's date. This is perfect for my needs: The reminder is automatic, but I can choose when to install the updates based on what I'm doing at the time. If it's been under a week since I last checked, nothing at all is different about my Terminal launch.

Read on for the script and implementation details. (Note: This is not written for a Terminal beginner, as it assumes some knowledge about how the shell works in macOS.)

[continue reading…]



Total PDF pages in subfolders across folder structure

Last week, I wrote a script that ran through a folder structure and output the page count of every PDF in all folders and sub-folders, and also spit out a grand total.

While this worked well, what I really wanted was a script that just totaled PDF pages by sub-folder, without seeing all the file-by-file detail. After trying to retrofit the first script, I realized that was a waste of time, and started over from scratch.

The resulting script works just as I'd like it to, traversing a folder structure and showing PDF page counts by folder:

$ countpdfbydir
    47: ./_Legal
     2: ./_Medical-Dental
    15: ./_Medical-Dental/Kids
    11: ./_Medical-Dental/Marian
     2: ./_Medical-Dental/Rob
    35: ./_Personal Documents/Kids
    87: ./_Personal Documents/Marian
    28: ./_Personal Documents/Rob
    10: ./_Personal Documents/Rob/Golf
    12: ./_Personal Documents/Rob/Travel
-------------------------------------------------------------------
   249: Total PDF Pages

It took a few revisions, but I like this version; it even does some simplistic padding to keep the figures lined up in the output.

[continue reading…]



Count pages in all PDFs within a folder structure

Please see this newer post, with a new script that provides subtotals by subfolder, which is what I really wanted when I wrote this one.

Recently I've been trying to go paperless (well, mostly paperless) via a Fujitsu ScanSanp ix500. (I'll have more to say about the scanner in a future post).

One way to go paperless is to just go from now forward—start scanning stuff and don't worry about history. I decided that I'd go the other route, and work through our old paper files: some would be scanned and kept, much would just be recycled. The process went really quickly, compared to what I had expected. It helps that the Fujitsu is a wicked-fast document scanner!

But I was curious about how much I was scanning, in terms of total PDF pages—not files, but counting the pages in the files. Spotlight to the rescue; the field kMDItemNumberOfPages returns the number of pages in a document, and it seemed accurate in testing via mdls:

$ mdls /path/to/somefile.pdf | grep kMDItemNumberOfPages
kMDItemNumberOfPages = 4

So I set out to write a script to traverse my "Scans" folder, and return the total number of PDF pages.

[continue reading…]



Encrypt files then backup to a cloud service via script

Most cloud services tell you that their data stores are safe, that your data is encrypted in transit and on their drives, that employees don't have access, etc. For the vast majority of the stuff I store in the cloud, this is more than good enough for me—the data isn't overly sensitive, and if someone were to hack their way in, all they'd get are a bunch of work and personal writing files and some family photos.

For other files—primarily financial and family related—those assurances just aren't enough for me. But I still want the flexibility and security that comes from having a copy of these files in the cloud. So what's a paranoid user to do to take advantage of the cloud, with added security, but with a minimum of hassle?

The solution I came up with involves using local encrypted disk images and a shell script. Using this script (and some means of scheduling it), you can automatically encrypt and back up whatever files you like to a cloud service.

[continue reading…]



My most-useful and least-used shell script

I have a large number of small shell scripts I've either written or collected over the years. Today I had the opportunity to use my favorite one—which is rare, as I only need it a couple times a year. But when I do need it, it's a wonderful little script.

It's also a very simple-minded script, as it does just one thing: it copies my public IP address to the clipboard and shows it in a pop-up message, as seen at right. OK, so that's two things, but they're very closely related.

Clearly this isn't something I need to do often, but when I do, the script changes this…

Switch to browser, open new tab, load the DynDNS check IP page, drag mouse to select IP address, press Command-C to copy, switch back to destination app, press Command-V to paste

…into this…

Press a key combo, wait about a second, then press Command-V

This is a big timesaver, obviously, and it makes the process about as easy as it could be.

I originally wrote this up for Mac OS X Hints a few years back, but thought I'd post it here (given the changes at Macworld, I'm not sure how long the hints site may be around). I've also modified it a bit, as I no longer use growlnotify for the onscreen display of the copied IP address.

You can read the original how-to at hints, or below, where I've posted the updated version that no longer uses growlnotify.

[continue reading…]