|
On-Screen OCR
by Alan German
Some people
seem to delight in making life hard. For example, there
are those who protect PDF files so that, in theory, one
cant copy the text. Of course, there are
workarounds for almost every such eventuality. In this
case, one can simply print a page of the file, scan it,
and run it through optical character recognition (OCR)
software. Recently, a colleague asked if there was a way
to shortcut this process by using OCR software directly
from the computer screen.
A little research on the Internet turned up several
candidate programs, including one in my particular
favourite category of free and open-source software. But,
how well do such programs perform? The acid test was
running Capture2Text an open-source program from
SourceForge on a page from Chris Taylors
article on disk image backups in the March issue of
OPCUGs newsletter.
Now, this wasnt because Chris is intent on stopping
people from reproducing his ideas, nor is the
newsletters PDF file protected from copying, it was
just that our newsletter provided a readily-available,
and relatively-simple page layout, with both text and
graphics, for me to use as a test case.
Having downloaded the OCR utility as a ZIP file, the
embedded files can easily be extracted to a Capture2Text
folder. The software doesnt need to be installed,
and running the executable (Capture2Text.exe) establishes
the program in a terminate-and-stay-resident mode.
Right-clicking on the programs icon in the task bar
provides a number of options. Checking Show Popup
Window is recommended as this will subsequently
provide direct confirmation of the text capture process
(see below).
I found the programs operation to be both
non-intuitive, and somewhat sensitive to mouse clicks.
Checking out the instructions in the readme.txt file is
highly recommended. You will find that the critical
keyboard shortcut is Windows-Key-Q to both start the
capture process and to end it. However, the subsequent
part of the instructions arent at all clear. They
indicate: Now, using your mouse, resize the capture
box over the area of the screen that you want to
OCR. The obvious choice would be to use the left
mouse key to drag some sort of frame over the text to be
selected. However, this is doomed to failure since, by
default, a left-mouse click actually ends the capture
process!
In fact, what you seem to need to do is press
Windows-Key-Q, wait a few seconds, and then move the
right-mouse button. A blue rectangle expands to select an
area of text. Holding the right-mouse button down, and
dragging the mouse, allows the rectangle to be
re-positioned on the screen so that, for example, the
top-left corner of a section of text can be selected.
Subsequently releasing the right-mouse button, and
dragging the mouse across the screen once again expands
the rectangle so that a specific block of text can be
defined. When the desired text has been highlighted in
this manner, pressing Windows-Key-Q a second time causes
the text capture process to be completed. The selected
text is now displayed in a pop-up window (entitled
Capture2Text Popup Result) confirming that the desired
text has been captured.
I found this mouse-key/select-drag sequence to be very
non-intuitive. I was forever pressing the left-mouse key
only to find that nothing further happened because
of course I had actually aborted the
text-capture process!
However, after a little trial and error, I successfully
saw the text that I had selected displayed in the pop-up
box. The same text had also been copied to the Windows
clipboard and so could readily be pasted into a
word-processing file. The following figure shows the
results for one section of a page from OPCUGs
newsletter.
A number of
points are worth noting:
1. The basic text is generally reproduced quite well by
the OCR process.
2. The OCR
process does not capture new paragraphs. When pasted from
the clipboard into a word processor, the text appears as
a single, continuous stream. For example, the text from
the first and second paragraphs are run together as
just under 25GB. There are lots
3. There are some errors in transcription, e.g.
?ill should have been full and
afier should have been after .
[However, sometimes the OCR process is deadly accurate
too accurate perhaps. The display of co
mpression and d isk (direct from my
Foxit Reader software) were both faithfully reproduced in
the OCRd text!]
4. The program has an editable suggestions.txt file that
can be used to train the OCR to recognize (or
at least replace) mistakes such as those noted above.
5. If too many errors are noted, it may be worthwhile
changing the on-screen zoom level for the original PDF
file, to show larger text, and then re-running the OCR
process.
6. The occurrence of a block of nonsensical text (e.g. _,
».-am-.,.._..,..,.~,.,.t.,,. Q [,_.,,L,,,-,») clearly
indicates the location of the graphic on the page so it
easy to identify and remove this material. [However, note
that the OCR software does a creditable job of
identifying snippets of text contained in the image (e.g.
Disk/Partition backup)!]
7. Running a spell checker on the word-processing file is
quite successful in identifying specific problems in the
transcribed text (e.g. co mpression)
8. Italic text didnt phase the OCR software; it
correctly identified Todo Backup has a Clone
operation
When testing the software on another PDF file, it was
found that this free software did a much better job on
tabular information than a number of commercial
applications. While the data elements from a sample table
were captured as a single paragraph on the
clipboard, the individual numeric values were clearly
delimited by spaces and held the possibility of being
transferred fairly easily into individual columns in a
spreadsheet. Since this is my colleagues specific
application, the program shows considerable promise.
But, even
better, when backing up the recently-changed files on my
hard drive, I discovered that the \Capture2Text\Output
folder contained a file named ocr.txt that held these
tabular values in a simple ASCII format in
individual rows! While this is clearly, an interim file
created by the program as part of the OCR process, if not
overwritten by another screen capture, it is available
for use outside of the program. And, the data format
should make transferring the tabular values to a
spreadsheet really simple.
Capture2Text does a pretty good job at capturing
on-screen text, converting it to machine-readable form
through an OCR process, and allowing the results to be
saved to a file. If you have such a need, give this
program a try. You may not be disappointed and, given the
price, you certainly wont be out of pocket!
Bottom Line:
Capture2Text (Open source)
Version 3.1
Christopher Brochtrup
http://capture2text.sourceforge.net/
Originally published: June, 2014
top of page
|
Archived Reviews
A-J
K-Q
R-Z
The opinions expressed in these reviews
do not necessarily represent the views of the
Ottawa PC Users' Group or its members.
|