[Top] [Contents] [Index] [ ? ]

GNU Ocrad

This manual is for GNU Ocrad (version 0.18, 8 May 2009).


GNU Ocrad is an OCR (Optical Character Recognition) program based on a feature extraction method. It reads images in pbm (bitmap), pgm (greyscale) or ppm (color) formats and produces text in byte (8-bit) or UTF-8 formats. The pbm, pgm and ppm formats are collectively known as pnm.

Ocrad includes a layout analyser able to separate the columns or blocks of text normally found on printed pages.


Copyright © 2003, 2004, 2005, 2006, 2007, 2008, 2009 Antonio Diaz Diaz.

This manual is free documentation: you have unlimited permission to copy, distribute and modify it.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1. Character Sets

The character set internally used by ocrad is ISO 10646, also known as UCS (Universal Character Set), which can represent over two thousand million characters (2^31).

As it is unpractical to try to recognize one among so many different characters, you can tell ocrad what character sets to recognize. You do this with the `--charset' option.

If the input page contains characters from only one character set, say `ISO-8859-15', you can use the default `byte' output format. But in a page with `ISO-8859-9' and `ISO-8859-15' characters, you can't tell if a code of 0xFD represents a 'latin small letter i dotless' or a 'latin small letter y with acute'. You should use `--format=utf8' instead.
Of course, you may request UTF-8 output in any case.


NOTE: 10^9 is a thousand millions, a billion is a million millions (million^2), a trillion is a million million millions (million^3), and so on. Please, don't "embrace and extend" the meaning of prefixes, making communication among all people difficult. Thanks.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2. Invoking Ocrad

The format for running ocrad is:

 
ocrad [options] [files]

Ocrad supports the following options:

`--help'
`-h'

Print an informative help message describing the options and exit. `ocrad --verbose --help' describes also hidden options.

`--version'
`-V'

Print the version number of ocrad on the standard output and exit.

`--append'
`-a'

Append generated text to the output file instead of overwriting it.

`--charset=name'
`-c name'

Enable recognition of the characters belonging to the given character set. You can repeat this option multiple times with different names for processing a page with characters from different character sets.
If no charset is specified, `iso-8859-15' (latin9) is assumed.
Try `--charset=help' for a list of valid charset names.

`--crop=left,top,right,bottom'
`-p left,top,right,bottom'

Crop the input image by the rectangle defined by left, top, right and bottom. The values of left, top, right and bottom may be relative to the image size (-1.0 <= value <= +1.0), or absolute (abs( value ) > 1). Negative values are relative to the right-bottom corner of the image. Absolute and relative values can be mixed. For example `ocrad --crop 700,960,1,1' will work as expected.
The cropping is performed before any other transformation (rotation or mirroring) on the input image, and before scaling, layout analysis and recognition.

`--filter=name'
`-e name'

Pass the output text through the given postprocessing filter.
`--filter=letters' forces every character that resembles a letter to be recognized as a letter. Other characters will be output without change.
`--filter=letters_only', same as `--filter=letters', but other characters will be discarded.
`--filter=numbers' forces every character that resembles a number to be recognized as a number. Other characters will be output without change.
`--filter=numbers_only', same as `--filter=numbers' but other characters will be discarded.
Try `--filter=help' for a list of valid filter names.

`--force'
`-f'

Force overwrite of output file.

`--format=name'
`-F name'

Select the output format. The valid names are `byte' and `utf8'.
If no output format is specified, `byte' (8 bit) is assumed.

`--invert'
`-i'

Invert image levels (white on black).

`--layout'
`-l'

Enable page layout analysis. Ocrad is able to separate blocks of text of arbitrary shape as long as they are clearly delimited by white space.

`-o file'

Place the output into file instead of into the standard output.

`--quiet'
`-q'

Quiet operation.

`--scale=value'
`-s value'

Scale the input image by value before layout analysis and recognition. If value is negative, the input image is scaled down by -value.

`--transform=name'
`-t name'

Perform given transformation (rotation or mirroring) on the input image before scaling, layout analysis and recognition.
Try `--transform=help' for a list of valid transformation names.

`--threshold=value'
`-T value'

Set binarization threshold for pgm or ppm files or for `--scale' option (only for scaled down images). value should be a rational number between 0 an 1, and may be given as a percentage (50%), a fraction (1/2), or a decimal value (0.5). Image values greater than threshold are converted to white. The default value is 0.5.

`--verbose'
`-v'

Verbose mode.

`-x file'

Write (export) OCR Results File to file. `-x -' writes to stdout, overriding text output except if output has been also redirected with the -o option.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3. Image Format Conversion

There are a lot of image formats, but ocrad is able to decode only three of them; pbm, pgm and ppm. In this chapter you will find command examples and advice about how to convert image files to a format that ocrad can manage.

`.png'

Portable Network Graphics file. Use the command pngtopnm filename.png | ocrad.
In some cases, like the ocrad.png icon, you have to invert the image with the `-i' option: pngtopnm filename.png | ocrad -i.

`.ps'
`.pdf'

Postscript or Portable Document Format file. Use the command gs -sPAPERSIZE=a4 -sDEVICE=pnmraw -r300 -dNOPAUSE -dBATCH -sOutputFile=- -q filename.ps | ocrad.
You may also use the command pstopnm -stdout -dpi=300 -pgm filename.ps | ocrad,
but it seems not to work with pdf files. Also old versions of pstopnm don't recognize the `-dpi' option and produce an image too small for OCR.

`.tiff'

TIFF file. Use the command
tifftopnm filename.tiff | ocrad.

`.jpg'

JPEG file. Use the command djpeg -greyscale -pnm filename.jpg | ocrad.
JPEG is a lossy format and is in general not recommended for text images.

`.pnm.gz'

Pnm file compressed with gzip. Use the command gzip -cd filename.pnm.gz | ocrad

`.pnm.lz'

Pnm file compressed with lzip. Use the command lzip -cd filename.pnm.lz | ocrad


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4. Algorithm

Ocrad is mainly a research project. Many of the algorithms ocrad uses are ad hoc, and will change in successive releases as I myself gain understanding about OCR issues.

The overall working of ocrad may be described as follows:
1) read the image.
2) optionally, perform some transformations (crop, rotate, scale, etc).
3) optionally, perform layout detection.
4) remove frames and images.
5) detect characters and group them in lines.
6) recognize characters (very ad hoc; one algorithm per character).
7) correct some errors (transform l.OOO into 1.000, etc).
8) output result.


Ocrad recognizes characters by its shape, and the reason it is so fast is that it does not compare the shape of every character against some sort of database of shapes and then chooses the best match. Instead of this, ocrad only compares the shape differences that are relevant to choose between two character categories, mostly like a binary search.

As there is no such thing as a free lunch, this approach has some drawbacks. It makes ocrad very sensitive to character defects, and makes difficult to modify ocrad to recognize new characters.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5. Reporting Bugs

There are probably bugs in ocrad. There are certainly errors and omissions in this manual. If you report them, they will get fixed. If you don't, no one will ever know about them and they will remain unfixed for all eternity, if not longer.

If you find a bug in GNU Ocrad, please send electronic mail to bug-ocrad@gnu.org. Include the version number, which you can find by running `ocrad --version'.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

Concept Index

Jump to:   A   B   G   I   O   U   V  
Index Entry Section

A
algorithm4. Algorithm

B
bugs5. Reporting Bugs

G
getting help5. Reporting Bugs

I
image format conversion3. Image Format Conversion
input charsets1. Character Sets
invoking2. Invoking Ocrad

O
options2. Invoking Ocrad
output format1. Character Sets

U
usage2. Invoking Ocrad

V
version2. Invoking Ocrad

Jump to:   A   B   G   I   O   U   V  

[Top] [Contents] [Index] [ ? ]

Table of Contents


[Top] [Contents] [Index] [ ? ]

About This Document

This document was generated on May, 12 2009 using texi2html 1.76.

The buttons in the navigation panels have the following meaning:

Button Name Go to From 1.2.3 go to
[ < ] Back previous section in reading order 1.2.2
[ > ] Forward next section in reading order 1.2.4
[ << ] FastBack beginning of this chapter or previous chapter 1
[ Up ] Up up section 1.2
[ >> ] FastForward next chapter 2
[Top] Top cover (top) of document  
[Contents] Contents table of contents  
[Index] Index index  
[ ? ] About about (help)  

where the Example assumes that the current position is at Subsubsection One-Two-Three of a document of the following structure:


This document was generated on May, 12 2009 using texi2html 1.76.