From: Digest To: "OS/2GenAu Digest" Date: Thu, 31 Jul 2008 00:00:39 EST-10EDT,10,1,0,7200,4,1,0,7200,3600 Subject: [os2genau_digest] No. 1684 Reply-To: X-List-Unsubscribe: www.os2site.com/list/ ************************************************** Wednesday 30 July 2008 Number 1684 ************************************************** Subjects for today 1 Re: Tesseract : Alan Duval 2 Re: Tesseract : Alan Duval 3 Re: Tesseract : Alan Duval 4 Re: Tesseract : Dennis Nolan **= Email 1 ==========================** Date: Wed, 30 Jul 2008 09:56:49 +1100 From: Alan Duval Subject: Re: Tesseract Voytek Eymont wrote: > > > >> I've been doing a lot of OCR work lately with Tesseract. I scan >> documents into C:\OCR where I also have Tesseract. >> > > is Tesseract the 'best OCR' there is ? > > being somewhat curious, I've installed T 2.01 (following this email), and, > tried it as follows: > > 0[roman][F:\ute\tesseract\usr\bin]tesseract.exe F:\FAXBANKSIA\FX006309.FAX > test > -l eng > > > it took about 60 seconds, and, the results (small typeface letter) were > somewhat not so great... > > by contrast, OCRing same file with PMfax took half the time, 24 secs, with > somewhat better results: > > PMfax OCR (24 secs) > ---- > 24 secs > Hi Voytek, I haven't used OCRing - didn't know it existed. Does it actually produce a text file and not a scanned copy? I know that you use PMfax a lot, so do you scan docs into PMfax and then use OCRing to convert them to text? I want to scan articles and convert them to text formats that I can store or send to friends. That means that they have to be converted to *.doc files or *.pdf files. So far I have found that Tesseract does a good job but it has trouble with the ' and - characters. I have found its word recognition to be better than SimpleOCR which I have in WIN XP. Tesseract works with *.tif files. I didn't think it would work with *.fax files. Regards, Alan ---------------------------------------------------------------------------------- **= Email 2 ==========================** Date: Wed, 30 Jul 2008 10:05:17 +1100 From: Alan Duval Subject: Re: Tesseract Voytek Eymont wrote: > > > >> Program field: C:\OCR\tesseract.exe >> Parameters: image[N1].tif [N2] >> Working directory: C:\OCR >> >> >> Now when I click on the program object a window comes up asking for the >> value of N1 which I type in and press enter. Then the same happens for N2 >> and the image is processed. With my word processor opened I can then open >> the file and correct any mistakes. >> >> Kris suggests that a REXX program could be written to simplify further >> the process. AS I don't know REXX I wonder whether someone could help? >> > > > so, N1 is say '123' and, N2, corresponding text file like '123.txt' > so, do you just want to process *.tiff into likewise named txt, is this > the general idea ? > Hi again, Yes! I may scan an article and it will be saved as say " image123.tif " in C:\OCR in which folder I also have Tesseract installed. I would then go to that folder via an OS/2 prompt and type: "Tesseract image123.tif 123" That then produces a file 123.txt which can be opened in a word processor for correction of any errors and for conversion to other formats. Regards, Alan ---------------------------------------------------------------------------------- **= Email 3 ==========================** Date: Wed, 30 Jul 2008 10:29:13 +1100 From: Alan Duval Subject: Re: Tesseract Voytek Eymont wrote: > > > > >> >> >> >> >>> Program field: C:\OCR\tesseract.exe >>> Parameters: image[N1].tif [N2] >>> Working directory: C:\OCR >>> >>> >>> >>> Now when I click on the program object a window comes up asking for the >>> value of N1 which I type in and press enter. Then the same happens for >>> N2 >>> and the image is processed. With my word processor opened I can then >>> open the file and correct any mistakes. >>> >>> Kris suggests that a REXX program could be written to simplify further >>> the process. AS I don't know REXX I wonder whether someone could help? >>> >> so, N1 is say '123' and, N2, corresponding text file like '123.txt' so, do >> you just want to process *.tiff into likewise named txt, is this the >> general idea ? >> > > actually, wouldn't 'runfor' do it for you ? > > runfor Ver 1.9 - Run a command - Mar 31 1998, W. Kim > Hi again Voytek, It probably would but whatever I do it still seems that I would have to enter values. I would like to just double click on a saved " image***.tif " file and have it open as text in a word processor much like one can click on an attachment to an Email and have it opened in a word processor. The best solution I have so far is that above. So if I scan an article and it is saved as "image123.tif ." I then click on my program object and a window comes up requesting the values for N1. I would type "123 " and then a second window would come up requesting the values for N2. I would again type " 123 " and the resulting " 123.txt " file would be placed in " C:\OCR ". With a word processor I can then open the file and process it further. One can make a program or command that will work for a specified *.tif file but it has to work for any *.tif files as I may be scanning many pages. Regards, Alan ---------------------------------------------------------------------------------- **= Email 4 ==========================** Date: Wed, 30 Jul 2008 16:28:45 +1000 From: Dennis Nolan Subject: Re: Tesseract Alan It could be done by associating tif files to your OCR program. Unfortunately this will associate all tif files with the OCR program. A better way is to create a Program object on your desktop. There is a Program Object template in the Templates folder. Make your OCR program the Object. From Memory you just need to drag it to the Object when creating it. During the creatioin you need to specify the dropped file as the input parameter. The Help file that you can access in the program object explain how to do this. If it is set up correctly you only need to drag and drop your tif files on the object for it to do its stuff. There is a way to get it to open your word processor too, but it's been too long for me to clearly remember how I used to do it. When set up you can select multiple files and drop them on the program object. A window for each dropped file will be created and closed when it is finished. You can also specify which directory to write the output file to. Regards Dennis. Alan Duval wrote: > Voytek Eymont wrote: >> >> >> >>> >>> >>> >>> >>>> Program field: C:\OCR\tesseract.exe >>>> Parameters: image[N1].tif [N2] >>>> Working directory: C:\OCR >>>> >>>> >>>> >>>> Now when I click on the program object a window comes up asking for the >>>> value of N1 which I type in and press enter. Then the same happens for >>>> N2 >>>> and the image is processed. With my word processor opened I can then >>>> open the file and correct any mistakes. >>>> >>>> Kris suggests that a REXX program could be written to simplify further >>>> the process. AS I don't know REXX I wonder whether someone could help? >>>> >>> so, N1 is say '123' and, N2, corresponding text file like '123.txt' >>> so, do >>> you just want to process *.tiff into likewise named txt, is this the >>> general idea ? >>> >> >> actually, wouldn't 'runfor' do it for you ? >> >> runfor Ver 1.9 - Run a command - Mar 31 1998, W. Kim >> > Hi again Voytek, > > It probably would but whatever I do it still seems that I would have to > enter values. I would like to just double click on a saved " > image***.tif " file and have it open as text in a word processor much > like one can click on an attachment to an Email and have it opened in a > word processor. > The best solution I have so far is that above. So if I scan an article > and it is saved as "image123.tif ." I then click on my program object > and a window comes up requesting the values for N1. I would type "123 > " and then a second window would come up requesting the values for N2. I > would again type " 123 " and the resulting " 123.txt " file would be > placed in " C:\OCR ". With a word processor I can then open the file > and process it further. > One can make a program or command that will work for a specified *.tif > file but it has to work for any *.tif files as I may be scanning many > pages. > > Regards, > > Alan > > ----------------------------------------------------------------------------------