The Robservatory

Robservations on everything…



Find and fix non-searchable PDFs

I use a ScanSnap ix500 scanner to scan a lot of paper into PDFs on my iMac. And thanks to the ScanSnap’s bundled optical character recognition (OCR), all of those scans are searchable via Spotlight. While the OCR may not be perfect, it’s generally more than good enough to find what I’m looking for.

However, I noticed that I had a number of PDFs that weren’t searchable—some electronic statements from credit cards and utility companies, and some older documents that predated my purchase of the ScanSnap, at least based on some tests with Spotlight.

But I wanted to know how many such PDFs I had, so I could run OCR on all of them, via the excellent PDFPen Pro app. (The Fujitsu’s software will only perform OCR on documents it scanned.) The question was how to find all such files, and then once found, how to most easily run them through PDFPen Pro’s OCR process.

In the end, I needed to install one set of Unix tools, and then write two small scripts—one shell script and one AppleScript. Of course, you’ll also need PDF Pen (I don’t think Pro is required), or some other app that can perform OCR on PDF files.


View app-specific log messages in Terminal

March 29 2018 Update:

When this tip was first posted, it didn’t work right: The log command ignored the --start, --end, and --last parameters. Regardless of what you listed for parameters, you’d always get the entire contents of the log file. I’m happy to note that this has been resolved in macOS 10.13.4, as log now functions as expected:

$ log show --last 20s --predicate 'processImagePath CONTAINS[c] "Twitter"'
Filtering the log data using "processImagePath CONTAINS[c] "Twitter""
Skipping info and debug messages, pass --info and/or --debug to include.
Timestamp                       Thread     Type        Activity             PID    TTL  
2018-03-30 09:26:15.357714-0700 0xc88a8    Default     0x0                  5075   0    Twitterrific: (CFNetwork) Task <9AD0920A-7AE7-4313-A727-6D34F4BBE38F>.<250> now using Connection 142
2018-03-30 09:26:15.357742-0700 0xc8d7a    Default     0x0                  5075   0    Twitterrific: (CFNetwork) Task <9AD0920A-7AE7-4313-A727-6D34F4BBE38F>.<250> sent request, body N
2018-03-30 09:26:15.420242-0700 0xc88a8    Default     0x0                  5075   0    Twitterrific: (CFNetwork) Task <9AD0920A-7AE7-4313-A727-6D34F4BBE38F>.<250> received response, status 200 content K
2018-03-30 09:26:15.420406-0700 0xc8d7a    Default     0x0                  5075   0    Twitterrific: (CFNetwork) Task <9AD0920A-7AE7-4313-A727-6D34F4BBE38F>.<250> response ended
Log      - Default:          4, Info:                0, Debug:             0, Error:          0, Fault:          0
Activity - Create:           0, Transition:          0, Actions:           0

This makes it really easy to get just the time slice you need from the overly-long log files. You can use s for seconds, m for minutes, h for hours, and d for days as arguments to these parameters.

This article provides a nice overview on interacting with log and predicates to filter the output—there’s a lot you can do to help figure out what might be causing a problem.

And now, here’s the rest of the original post…


Revisiting a PDF page counting script

A couple of years back, I created a bash script to count PDF pages across subfolders. Here’s how it looks when run on my folder of Apple manuals:

I use this script on the top-level folder where I save all my Fujitsu ScanSnap iX500 scans. Why? Partly because I’m a geek, and partly because it helps me identify folders I might not need to keep on their own—if there are only a few pages in a folder, I’ll generally try to consolidate its contents into another lightly-used folder.

The script I originally wrote worked fine, and still works fine—sort of. When I originally wrote about it, I said…

I feared this would be incredibly slow, but it only took about 40 seconds to traverse a folder structure with about a gigabyte of PDFs in about 1,500 files spread across 160 subfolders, and totalling 5,306 PDF pages.

That was then, this is now: With 12,173 pages of PDFs spread across 4,475 files in 295 folders, the script takes over two minutes to run—155 seconds, to be precise. That’s not anywhere near acceptable, so I set out to see if I could improve my script’s performance.

In the end, I succeeded—though it was more of a “we succeeded” thing, as my friend James (who uses a very similar scan-and-file setup) and I went back-and-forth with changes over a couple days. The new script takes just over 10 seconds to count pages in the same set of files. (It’s even more impressive if the files aren’t so spread out—my eBooks/Manuals folder has over 12,000 pages, too, but in just 139 files in 43 folders…the script runs in just over a second.)

Where’d the speed boost come from? One simple change that seems obvious in hindsight, but I was amazed actually worked…


See how long an app has been running

For a recent customer support question, I needed to know how long our app Witch had been running. There are probably many ways to find this out, but I couldn’t think of one. A quick web search found the solution, via ps and the etime flag.

You need the process ID (pid), which you can find via ps ax | grep [a]ppname.1That [s]quare brackets around the first letter are there so grep won’t find itself—and thus list itself in the output. In my case, Witch runs a background task called witchdaemon, so I did it this way…

$ ps -ax | grep [w]itchd
  774 ??        26:40.73 /Users/robg/Library/PreferencePanes...[trimmed]

With the pid, the command to find that process’ uptime is:

$ ps -o etime= -p "774"

The elapsed time readout is in the form of dd-hh:mm:ss, so Witch had been running for 11 days and a few hours and minutes. Note that you can combine these steps, getting the process ID and using it in the ps command all at once:

ps -o etime= -p "`ps -ax | grep [a]ppname | cut -d ' ' -f 1`"

It’s messy looking, but this form saves time and typing.

Selective pruning of old rsync backups

In yesterday’s post, I described a couple rsync oddities, and how they’d led me to this modified command for pruning old (older than four days) backups:

find /path/to/backups/ -d 1 -type d -Bmin +$((60*4*24)) -maxdepth 1 -exec rm -r {} +

After getting this working, though, I wondered if it’d be possible to keep my backups from the first day of each month, even while clearing out the other dates. After some digging in the rsync man page, and testing in Terminal, it appears it’s possible, with some help from regex.

My backup folders are named with a trailing date and time stamp, like this:


To keep any backups made on the first of any month, for my folder naming schema, the modified find command would look like this:

find /path/to/backups/ -d 1 -type d -Bmin +$((60*4*24)) -maxdepth 1 -not -regex ".*-01_.*" -exec rm -r {} +

The new bits, -not -regex ".*-01_.*" basically say “find only files that do not contain anything surrounding a string that is ‘hyphen 01 underscore.’ And because only backups made on the first of the month will contain that pattern, they’re the only ones that will be left out of the purge.

This may be of interest to maybe two people out there; I’m documenting it so I remember how it works!

How to install ruby gems in Terminal

In yesterday’s tip, See sensor stats in Terminal, I implied that installation of the iStats ruby gem was a simple one-line command. As a commenter pointed out, that’s only true if you already have the prerequisites installed. The prerequisites in this case are the Xcode command line tools. Thankfully, you can install those without installing the full 5GB Xcode development environment.

(Rather than starting from scratch, I’m just going to borrow this bit from my detailed instructions for installing the transcode-video tools, because the Xcode command line tools are required there, too.)

Here’s how to install the command line tools. Open Terminal, paste the following line, and press Return:

xcode-select --install

When you hit Return, you’ll see a single line in response to your command:

$ xcode-select --install
xcode-select: note: install requested for command line developer tools

At this point, macOS will pop up a dialog, which is somewhat surprising as you’re working in the decidedly non-GUI Terminal:

Do not click Get Xcode, unless you want to wait while 5GB of data downloads and installs on your Mac. Instead, click the Install button, which will display an onscreen license agreement. Click Agree, then let the install finish—it’ll only take a couple of minutes.

If you’re curious as to what just happened, the installer created a folder structure in the top-level Library folder (/Library > Developer > CommandLineTools), and installed a slew of programs in the usr folder within the CommandLineTools folder.


See sensor stats in Terminal

Someone—perhaps it was Kirk—pointed me at this nifty Ruby gem to read and display your Mac’s sensors in Terminal: iStats — not to be confused with iStat Menus, a GUI tool that does similar things.

Installation is sinmple, via sudo gem install iStats. After a few minutes, iStats will be ready to use. In its simplest form, call istats by itself with no parameters. Normally I’d list the Terminal output here, but istats (by default, can be disabled) presents informatiomn with neat little inline bar graphs, so here’s a screenshot:

This tool is especially useful on a laptop, as it provides an easy-to-read battery summary.


Easy Unix date formatting

I use the date function quite often in scripts, mainly to append date/time stamps to filenames. For example, something like this…

newtime=`date +%Y-%m-%d_%H%M`
cp somefile $newtime-some_other_file

That particular format is the one I use most often, with the full date followed by the hours and minutes in 24 hour format: 2017-04-12_2315, for example. I use this one so that filenames wind up sorted by date order in Finder views.

Once I move beyond that format, though, the vagaries of date string formatting leave me dazed. Enter, where you can build any date string you like using a point-and-click editor with real-time previews:

It doesn’t get much easier than that.

Create Time Machine-like backups via rsync

Taking a break from the recent Frankenmac posts, here’s a little trick for creating “Time Machine like” backups of anything you’d care to back up1I don’t know how well this might work for Mac files, as opposed to Unix files. But Mac files can be saved to the real Time Machine.. In my case, it’s the HTML files off of my web sites, both personal and work. I used to simply back these up, but then realized it’d be better to have versions rather than totally overwriting the backup each day (which is what I had been doing).

Once you’ve got it set up and working, you’ll have a folder structure similar to the one at right, with one folder for each backup, and a “current” link that takes you to the newest backup.

I get zero credit for this one; my buddy James explained that he’d been using this method for a year without any troubles, and pointed me to this great guide that explains the process.

I used that guide and added the following to my backup script to create my own customized Time Machine for the files from here,

/usr/local/bin/rsync -aP \
  --link-dest=/path/to/quasi/TM_backup/current user@host:/path/to/files/on/server/ \
  --exclude "errors.csv" \
  --delete --delete-excluded \
rm -f /path/to/quasi/TM_backup/current
ln -s /path/to/quasi/TM_backup/back-$newtime /path/to/quasi/TM_backup/current

And that’s all there is to it. Note that you may need a newer version of rsync than what comes with macOS now (2.6.9)—I use version 3.1.2 from Homebrew, so I can’t say for sure that this script works with the stock version.

I’ve only been using this for a couple weeks, but it’s working well for me so far.

Adjusting for the oddities of ctime

In the shell script I use to back up my web sites (I really should update that, they’re much different now), I include a line that trims the backup folder of older compressed backups of the actual WordPress databases. That line used to look like this:

find path/to/sqlfiles/backups -ctime +5 -delete

I thought this should delete all backups in that folder that are at least five days old, via the ctime +5 bit.1Footnote: I know now I should have been using mtime, though it would have had the same issue I had with ctime. But it turns out I thought wrong. The above will delete all files that are at least six days old. Why? I don’t know why it works this way, but it’s mostly explained in the man page for find (my emphasis added):

-ctime n[smhdw] If no units are specified, this primary evaluates to true if the difference between the time of last change of file status information and the time find was started, rounded up to the next full 24-hour period, is n 24-hour periods. If units are specified, this primary evaluates to true if the difference between the time of last change of file status information and the time find was started is exactly n units. Please refer to the -atime primary description for information on supported time units.

To make find do what I wanted it to do, I just needed to change +5 to +5d. Simple enough…but while figuring this out, I stumbled across this page, which has an alternative solution with more flexibility:

find path/to/sqlfiles/backups -mmin +$((60*24*5)) -delete

The mmin parameter is much more precise than ctime:

-mmin n True if the difference between the file last modification time and the time find was started, rounded up to the next full minute, is n minutes.

By using mmin, I can be really precise. As shown, 60*24*5 gets me the same five-day interval as ctime +5d. (And yes, I could have used 7200 instead of 60*24*5, but I find it clearer to leave it in its expanded form.)

But I could instead delete backups that were older than 3.25 days (60*24*3.25 or 5040), or for any other arbitrary time period. I like the flexibility this offers over ctime, so I’ve switched my script over to this form.

The Robservatory © 2017 Built from the Frontier theme