Skip to content

Unix

Find and fix non-searchable PDFs

I use a ScanSnap ix500 scanner to scan a lot of paper into PDFs on my iMac. And thanks to the ScanSnap's bundled optical character recognition (OCR), all of those scans are searchable via Spotlight. While the OCR may not be perfect, it's generally more than good enough to find what I'm looking for.

However, I noticed that I had a number of PDFs that weren't searchable—some electronic statements from credit cards and utility companies, and some older documents that predated my purchase of the ScanSnap, at least based on some tests with Spotlight.

But I wanted to know how many such PDFs I had, so I could run OCR on all of them, via the excellent PDFPen Pro app. (The Fujitsu's software will only perform OCR on documents it scanned.) The question was how to find all such files, and then once found, how to most easily run them through PDFPen Pro's OCR process.

In the end, I needed to install one set of Unix tools, and then write two small scripts—one shell script and one AppleScript. Of course, you'll also need PDFPen (I don't think Pro is required), or some other app that can perform OCR on PDF files.

[continue reading…]



Revisiting a PDF page counting script

A couple of years back, I created a bash script to count PDF pages across subfolders. Here's how it looks when run on my folder of Apple manuals:

I use this script on the top-level folder where I save all my Fujitsu ScanSnap iX500 scans. Why? Partly because I'm a geek, and partly because it helps me identify folders I might not need to keep on their own—if there are only a few pages in a folder, I'll generally try to consolidate its contents into another lightly-used folder.

The script I originally wrote worked fine, and still works fine—sort of. When I originally wrote about it, I said…

I feared this would be incredibly slow, but it only took about 40 seconds to traverse a folder structure with about a gigabyte of PDFs in about 1,500 files spread across 160 subfolders, and totalling 5,306 PDF pages.

That was then, this is now: With 12,173 pages of PDFs spread across 4,475 files in 295 folders, the script takes over two minutes to run—155 seconds, to be precise. That's not anywhere near acceptable, so I set out to see if I could improve my script's performance.

In the end, I succeeded—though it was more of a "we succeeded" thing, as my friend James (who uses a very similar scan-and-file setup) and I went back-and-forth with changes over a couple days. The new script takes just over 10 seconds to count pages in the same set of files. (It's even more impressive if the files aren't so spread out—my eBooks/Manuals folder has over 12,000 pages, too, but in just 139 files in 43 folders…the script runs in just over a second.)

Where'd the speed boost come from? One simple change that seems obvious in hindsight, but I was amazed actually worked…

[continue reading…]



See how long an app has been running

For a recent customer support question, I needed to know how long our app Witch had been running. There are probably many ways to find this out, but I couldn't think of one. A quick web search found the solution, via ps and the etime flag.

You need the process ID (pid), which you can find via ps ax | grep [a]ppname.1That [s]quare brackets around the first letter are there so grep won't find itself—and thus list itself in the output. In my case, Witch runs a background task called witchdaemon, so I did it this way…

$ ps -ax | grep [w]itchd
  774 ??        26:40.73 /Users/robg/Library/PreferencePanes...[trimmed]

With the pid, the command to find that process' uptime is:

$ ps -o etime= -p "774"
11-03:17:12

The elapsed time readout is in the form of dd-hh:mm:ss, so Witch had been running for 11 days and a few hours and minutes. Note that you can combine these steps, getting the process ID and using it in the ps command all at once:

ps -o etime= -p "`ps -ax | grep [a]ppname | cut -d ' ' -f 1`"

It's messy looking, but this form saves time and typing.

June 2018 Addendum: If you add the lstart flag, you can see the exact start date and time for the process. For example:

$ ps -o lstart= -o etime= -p "16866"
Tue Jun  5 05:49:19 2018     09-06:11:25


Selective pruning of old rsync backups

In yesterday's post, I described a couple rsync oddities, and how they'd led me to this modified command for pruning old (older than four days) backups:

find /path/to/backups/ -d 1 -type d -Bmin +$((60*4*24)) -maxdepth 1 -exec rm -r {} +

After getting this working, though, I wondered if it'd be possible to keep my backups from the first day of each month, even while clearing out the other dates. After some digging in the rsync man page, and testing in Terminal, it appears it's possible, with some help from regex.

My backup folders are named with a trailing date and time stamp, like this:

back-2017-05-01_2230
back-2017-05-02_0534
back-2017-05-02_1002

To keep any backups made on the first of any month, for my folder naming schema, the modified find command would look like this:

find /path/to/backups/ -d 1 -type d -Bmin +$((60*4*24)) -maxdepth 1 -not -regex ".*-01_.*" -exec rm -r {} +

The new bits, -not -regex ".*-01_.*" basically say "find only files that do not contain anything surrounding a string that is 'hyphen 01 underscore.' And because only backups made on the first of the month will contain that pattern, they're the only ones that will be left out of the purge.

This may be of interest to maybe two people out there; I'm documenting it so I remember how it works!



How to not accidentally delete all your rsync backups

With my Time Machine-like rsync backups running well, I decided it was time to migrate over the cleanup portion of my old script—namely, the bit that removes older backups. Soon after I added this bit to my new script, though, I had a surprise: All of my backups, save the most recent, vanished.

In investigating why this happened, I stumbled across two rsync/macOS behaviors that I wasn't aware of…and if you're using rsync for backup, they may be of interest to you, too.

[continue reading…]



How to burn an ISO file to a USB stick

I wanted to install Linux on a hard drive in Frankenmac, as Clover is a multi-boot utility—it lets you choose from any OS it sees during power up. (I'll add Windows, too, eventually.) To do this, you need to get Linux onto a USB stick. I've done this in the past, and my vague recollection of the process was download the ISO, convert to an image file, write image file to USB stick. However, as it'd been a few years, I went searching for references to make sure I had all the commands correct.

I found a lot of pages with a general summary of the process, and few with the specific steps. I tried one of those, but my USB stick didn't work. The other specific pages contained the same basic process, so I was stuck. Until I found this page, which contained a critical step I was missing: Formatting the USB stick before copying the image file.

For future reference, here's the precise process to follow if you want to burn an ISO file onto a USB stick…

[continue reading…]



How to install ruby gems in Terminal

In yesterday's tip, See sensor stats in Terminal, I implied that installation of the iStats ruby gem was a simple one-line command. As a commenter pointed out, that's only true if you already have the prerequisites installed. The prerequisites in this case are the Xcode command line tools. Thankfully, you can install those without installing the full 5GB Xcode development environment.

(Rather than starting from scratch, I'm just going to borrow this bit from my detailed instructions for installing the transcode-video tools, because the Xcode command line tools are required there, too.)

Here's how to install the command line tools. Open Terminal, paste the following line, and press Return:

xcode-select --install

When you hit Return, you'll see a single line in response to your command:

$ xcode-select --install
xcode-select: note: install requested for command line developer tools

At this point, macOS will pop up a dialog, which is somewhat surprising as you're working in the decidedly non-GUI Terminal:

Do not click Get Xcode, unless you want to wait while 5GB of data downloads and installs on your Mac. Instead, click the Install button, which will display an onscreen license agreement. Click Agree, then let the install finish—it'll only take a couple of minutes.

If you're curious as to what just happened, the installer created a folder structure in the top-level Library folder (/Library > Developer > CommandLineTools), and installed a slew of programs in the usr folder within the CommandLineTools folder.

[continue reading…]



See sensor stats in Terminal

Someone—perhaps it was Kirk—pointed me at this nifty Ruby gem to read and display your Mac's sensors in Terminal: iStats -- not to be confused with iStat Menus, a GUI tool that does similar things.

Installation is sinmple, via sudo gem install iStats. After a few minutes, iStats will be ready to use. In its simplest form, call istats by itself with no parameters. Normally I'd list the Terminal output here, but istats (by default, can be disabled) presents informatiomn with neat little inline bar graphs, so here's a screenshot:

This tool is especially useful on a laptop, as it provides an easy-to-read battery summary.

[continue reading…]



Easy Unix date formatting

I use the date function quite often in scripts, mainly to append date/time stamps to filenames. For example, something like this…

newtime=`date +%Y-%m-%d_%H%M`
cp somefile $newtime-some_other_file

That particular format is the one I use most often, with the full date followed by the hours and minutes in 24 hour format: 2017-04-12_2315, for example. I use this one so that filenames wind up sorted by date order in Finder views.

Once I move beyond that format, though, the vagaries of date string formatting leave me dazed. Enter strftime.net, where you can build any date string you like using a point-and-click editor with real-time previews:

It doesn't get much easier than that.



Create Time Machine-like backups via rsync

Taking a break from the recent Frankenmac posts, here's a little trick for creating "Time Machine like" backups of anything you'd care to back up1I don't know how well this might work for Mac files, as opposed to Unix files. But Mac files can be saved to the real Time Machine.. In my case, it's the HTML files off of my web sites, both personal and work. I used to simply back these up, but then realized it'd be better to have versions rather than totally overwriting the backup each day (which is what I had been doing).

Once you've got it set up and working, you'll have a folder structure similar to the one at right, with one folder for each backup, and a "current" link that takes you to the newest backup.

I get zero credit for this one; my buddy James explained that he'd been using this method for a year without any troubles, and pointed me to this great guide2The original site that hosted this script is gone; I've linked to a copy I found on archive.org. Original URL:
https://blog.interlinked.org/tutorials/rsync_time_machine.html
that explains the process.

I used that guide and added the following to my backup script to create my own customized Time Machine for the files from here, robservatory.com:

/usr/local/bin/rsync -aP \
  --link-dest=/path/to/quasi/TM_backup/current user@host:/path/to/files/on/server/ \
  --exclude "errors.csv" \
  --delete --delete-excluded \
  /path/to/quasi/TM_backup/back-$newtime
rm -f /path/to/quasi/TM_backup/current
ln -s /path/to/quasi/TM_backup/back-$newtime /path/to/quasi/TM_backup/current

And that's all there is to it. Note that you may need a newer version of rsync than what comes with macOS now (2.6.9)—I use version 3.1.2 from Homebrew, so I can't say for sure that this script works with the stock version.

I've only been using this for a couple weeks, but it's working well for me so far.