Ottawa PC Users' Group, Inc.
 Product Review 


On-Screen OCR
by Alan German

Some people seem to delight in making life hard. For example, there are those who protect PDF files so that, in theory, one can’t copy the text. Of course, there are workarounds for almost every such eventuality. In this case, one can simply print a page of the file, scan it, and run it through optical character recognition (OCR) software. Recently, a colleague asked if there was a way to shortcut this process by using OCR software directly from the computer screen.

A little research on the Internet turned up several candidate programs, including one in my particular favourite category of free and open-source software. But, how well do such programs perform? The acid test was running Capture2Text – an open-source program from SourceForge – on a page from Chris Taylor’s article on disk image backups in the March issue of OPCUG’s newsletter.

Now, this wasn’t because Chris is intent on stopping people from reproducing his ideas, nor is the newsletter’s PDF file protected from copying, it was just that our newsletter provided a readily-available, and relatively-simple page layout, with both text and graphics, for me to use as a test case.

Having downloaded the OCR utility as a ZIP file, the embedded files can easily be extracted to a Capture2Text folder. The software doesn’t need to be installed, and running the executable (Capture2Text.exe) establishes the program in a terminate-and-stay-resident mode. Right-clicking on the program’s icon in the task bar provides a number of options. Checking “Show Popup Window” is recommended as this will subsequently provide direct confirmation of the text capture process (see below).

I found the program’s operation to be both non-intuitive, and somewhat sensitive to mouse clicks. Checking out the instructions in the readme.txt file is highly recommended. You will find that the critical keyboard shortcut is Windows-Key-Q to both start the capture process and to end it. However, the subsequent part of the instructions aren’t at all clear. They indicate: “Now, using your mouse, resize the capture box over the area of the screen that you want to OCR.” The obvious choice would be to use the left mouse key to drag some sort of frame over the text to be selected. However, this is doomed to failure since, by default, a left-mouse click actually ends the capture process!

In fact, what you seem to need to do is press Windows-Key-Q, wait a few seconds, and then move the right-mouse button. A blue rectangle expands to select an area of text. Holding the right-mouse button down, and dragging the mouse, allows the rectangle to be re-positioned on the screen so that, for example, the top-left corner of a section of text can be selected. Subsequently releasing the right-mouse button, and dragging the mouse across the screen once again expands the rectangle so that a specific block of text can be defined. When the desired text has been highlighted in this manner, pressing Windows-Key-Q a second time causes the text capture process to be completed. The selected text is now displayed in a pop-up window (entitled Capture2Text Popup Result) confirming that the desired text has been captured.

I found this mouse-key/select-drag sequence to be very non-intuitive. I was forever pressing the left-mouse key only to find that nothing further happened – because – of course – I had actually aborted the text-capture process!

However, after a little trial and error, I successfully saw the text that I had selected displayed in the pop-up box. The same text had also been copied to the Windows clipboard and so could readily be pasted into a word-processing file. The following figure shows the results for one section of a page from OPCUG’s newsletter.

A number of points are worth noting:

1. The basic text is generally reproduced quite well by the OCR process.

2. The OCR process does not capture new paragraphs. When pasted from the clipboard into a word processor, the text appears as a single, continuous stream. For example, the text from the first and second paragraphs are run together as “…just under 25GB. There are lots…”

3. There are some errors in transcription, e.g. “?ill” should have been “full” and “afier” should have been “after” . [However, sometimes the OCR process is deadly accurate – too accurate perhaps. The display of “co mpression” and “d isk” (direct from my Foxit Reader software) were both faithfully reproduced in the OCR’d text!]

4. The program has an editable suggestions.txt file that can be used to “train” the OCR to recognize (or at least replace) mistakes such as those noted above.

5. If too many errors are noted, it may be worthwhile changing the on-screen zoom level for the original PDF file, to show larger text, and then re-running the OCR process.

6. The occurrence of a block of nonsensical text (e.g. _, .-am-.,.._..,..,.~,.,.t.,,. Q [,_.,,L,,,-,) clearly indicates the location of the graphic on the page so it easy to identify and remove this material. [However, note that the OCR software does a creditable job of identifying snippets of text contained in the image (e.g. Disk/Partition backup)!]

7. Running a spell checker on the word-processing file is quite successful in identifying specific problems in the transcribed text (e.g. “co mpression”)

8. Italic text didn’t phase the OCR software; it correctly identified “Todo Backup has a Clone operation…”

When testing the software on another PDF file, it was found that this free software did a much better job on tabular information than a number of commercial applications. While the data elements from a sample table were captured as a single “paragraph” on the clipboard, the individual numeric values were clearly delimited by spaces and held the possibility of being transferred fairly easily into individual columns in a spreadsheet. Since this is my colleague’s specific application, the program shows considerable promise.

But, even better, when backing up the recently-changed files on my hard drive, I discovered that the \Capture2Text\Output folder contained a file named ocr.txt that held these tabular values – in a simple ASCII format – in individual rows! While this is clearly, an interim file created by the program as part of the OCR process, if not overwritten by another screen capture, it is available for use outside of the program. And, the data format should make transferring the tabular values to a spreadsheet really simple.

Capture2Text does a pretty good job at capturing on-screen text, converting it to machine-readable form through an OCR process, and allowing the results to be saved to a file. If you have such a need, give this program a try. You may not be disappointed and, given the price, you certainly won’t be out of pocket!


Bottom Line:

Capture2Text (Open source)
Christopher Brochtrup
Version 3.1
http://capture2text.sourceforge.net/


Click here to view the full OPCUG website with frames.

Copyright and Usage
Ottawa Personal Computer Users' Group (OPCUG), Inc.
3 Thatcher Street, Ottawa, ON  K2G 1S6

The opinions expressed in these reviews do not necessarily
represent the views of the OPCUG or its members.

Send comments or suggestions to the .