by Jay Velgos
"Never Retype Anything Again!" the ads promise us. "Scan Whole Filing Cabinets Directly into Microsoft Word!" we see in big, enthusiastic type, as if it were a "Lose Weight Now!" ad in TV Guide. This high-tech miracle on the market in these ads is called OCR, for Optical Character Recognition. The technology's not all that new, but the masses are now discovering it, often because one version or another has been bundled with their $100 scanner from the neighborhood "Computers R Us."
Can these claims possibly be true? Can OCR technology actually take stacks of manuscripts and magically reduce them to a convenient floppy disk? Surprisingly, the answer is yes, but with about a magazine article's worth of caveats.
So let's get started.
First off, it's important to make a distinction between a document on a computer screen and a picture of a document on a computer screen. In my work, I've encountered a lot of very smart people who don't immediately grasp the fundamental difference, so let's take a moment to get everybody on the same page.
When an actual document is on your screen, you can add, delete, or change the words. You can make them bigger or smaller, and you can copy words and move them somewhere else. This is possible because all the letters of all the words - as well as all the formatting - exist in the file behind the scenes, like the gears and springs hidden behind the face of a clock.
Now, if you pulled a page from your stack of laser-printed resumes and put it face down on your $99 scanner, within a few seconds you could be reading those job highlights right on your computer screen. You could even print it out. But trying to edit the text would be like attempting to wind up a picture of an alarm clock. The behind-the-scenes part of the file doesn't contain words, just instructions about the dots that make the image: where they go and what color they are. Try adding, deleting, or changing words on the screen, and you'll soon discover the futility. Like painting Wite-Out on your monitor, it just doesn't work.
"But wait," you say, "I can read the words on either one!" Well of course you can You've been to school, educated, and conditioned to recognize the funny little symbols as letters, numbers, words, and ideas. In fact, even when some of the little symbols are missing or written even more funny than usual, you can still figure out what it all means. That's because you're smart. And you've been taught well.
Computers can be taught too, and it's OCR software that teaches computers how to recognize letters, which - although it is not the same as recognizing words or ideas - is all that's necessary to convert a picture of a document into an actual "editable" or "text" document.
OCR applications range in cost from more than $600 all the way down to, well, free. (Keep in mind that the free ones are often limited versions of the actual applications, carefully designed to entice you into purchasing the full retail version.) Despite the cost disparity, though, they all do essentially the same thing: analyze the patterns on the digital image (the "picture") of the page and try to match each one to a known letter, number, or symbol. It's sort of like a kindergartner pointing at headlines: "There's an A . . . there's a . . . G . . ." The computer does it much faster, and instead of saying the letter out loud (or asking for more juice), it adds that letter to a text file. By the time the computer is done with the image, it's created an editable text file that you can import into your favorite word processor program and start reformatting, printing, and sending to your friends, family, and co-workers.
Oh, there is one step I left out. It's often only mentioned briefly in the OCR instructions. You'll almost certainly have to edit the text file, correct the mistakes, and fill in the parts where the computer was totally stumped. You will probably see a lot of tildes (~), escape arrows (^), and other marks that the computer uses to indicate "Huh?"
The number of "Huh?" marks in your text file will vary, based on several criteria including: 1) how much you paid for the OCR software; and 2) how quickly you need to get the job finished. Other important criteria include the size and style of the type face. (For instance, out of the 200 fonts I have on my computer, OCR software would probably recognize 12 point Times Roman more accurately than, say, 6 point Haettenschweiler.)
Let's look at the commonly-seen claim of "Up to 99% Accuracy!" For one thing, that's 99 percent of letters, not words, so mathematically we're looking at about one mistake for each 20 (five letter) words. The "Up to" part is, of course, ad-speak for "we're referring to the best-case scenario, but your mileage may vary."
Imagine an 85 percent accuracy rate, which is a more realistic number to expect, especially if you've scanned anything besides crisp, dark, laser printed pages with standard fonts and margins. At 15 mistakes per each 100 characters, you could possibly end up with 15 out of 20 words having mistakes in them; mistakes you'll have to manually correct. At best, it would be three massively mangled words out of each 20.
Despite everything noted above, I'm a big fan of OCR technology. One reason is because it's an important element behind Project Gutenberg (www.promo.net/pg ), which has taken tens of thousands of books (all in the public domain) and converted them to electronic format for free distribution on the Internet. I've used OCR technology both on my job and at home and have found it to be aggravating at times, but still useful and timesaving. And even fun.
The key is picking the right project. Let's say you've got a manuscript that you'd like to put into an electronic format, or maybe make available online. Here are some important questions to ask:
If the objective of your project is to take a bound volume, typewritten manuscript, or other printed resource and make it available over the Internet, you had better make sure that you either own the copyright or that the material is already in the public domain. A related issue is:
Copyright can be difficult to enforce for resources available online. Once your resource is publicly available, unscrupulous entrepreneurs could easily parlay your investment into their own profitable CD-ROM or published book.
This should probably have been the first question. If the answer is yes, then just stop dreaming. Although there is a flavor of OCR technology called Intelligent Character Recognition (ICR), it has the distinction of being both ineffectual and expensive. (And no, despite what you're thinking, it's not a Microsoft product.)
If you - with your professional degree and the appropriate bifocal lenses - can't read the page, odds are about 110 percent that the computer won't be able to read it either. How is the contrast between the ink and the paper? Is it a blue carbon copy or an ancient Ditto machine purple copy? Any of these could spell trouble for the project. A good way to preview the scan-ability of a page is to copy it on your office photocopy machine, since the technology is similar. If the type fades out, you'll probably have a problem.
You won't hear the phrase "It goes without saying" in this article. If you've only got a single copy of the manuscript, don't even think about trusting it to the automatic document feeder (ADF) on your scanner, unless you're ready to learn the true meaning of "fold, spindle, and mutilate." I'm not saying that ADFs are unreliable and should be avoided, just that an ADF is more likely to mangle an original document than a copy of one. This is not Murphy's Law at work; only the fact that an original document is more likely to have dog-eared corners, wrinkles, or pages that aren't lined up perfectly square, all of which are major causes of paper jams.
If your document is very old, you may wish to first discuss the scanning with a conservator. Although most scanners expose pages to less light than a copy machine, certain documents will require extra special care.
In addition to the copyright concerns, scanning books introduces yet another obstacle: the binding. If you have to bend the book to place a page squarely on the scanner bed, will that ruin the binding? Do you care? If you don't care, you can bend all you want. And if you really don't care about the binding, you might want to consider having it professionally removed from the book entirely, which would ensure quicker scanning and flatter, less distorted images, and a more accurate OCR. Keep in mind that this option effectively destroys the book, but if you've got several copies, it may be a reasonable sacrifice to make.
Or rather: are there copy edit marks all over the pages? If the answer is yes, do whatever is necessary to find an alternate, unmarked copy. You could have the clearest text and pristine character shapes on your scanned image, but if there are squiggles, underlines, circles, arrows, or any other marks that even partially cover the actual text, your OCR application will not just say "Huh?" - it will start desperately grasping for meaning, sort of like Keanu Reeves in that movie where he played Buddha. It's not pretty.
With that disturbing image in mind, I'll leave you with some suggestions on where to learn more.
Good Luck!
Jay Velgos works in the Library Resource
Sharing Division at the Texas State Library and Archives Commission.