From: Digest To: "OS/2GenAu Digest" Date: Sun, 3 Aug 2008 00:00:40 EST-10EDT,10,1,0,7200,4,1,0,7200,3600 Subject: [os2genau_digest] No. 1686 Reply-To: X-List-Unsubscribe: www.os2site.com/list/ ************************************************** Saturday 02 August 2008 Number 1686 ************************************************** Subjects for today 1 Re: Tesseract : Voytek Eymont" 2 Re: Tesseract : Voytek Eymont" 3 Re: Tesseract : Alan Duval 4 Re: Tesseract : Alan Duval **= Email 1 ==========================** Date: Fri, 1 Aug 2008 23:57:32 +1000 (EST) From: "Voytek Eymont" Subject: Re: Tesseract > I haven't tried this (I don't even have Tesseract), but I think what you > need to do is write a simple Rexx script that does three things in > sequence: > - parse its program argument to decompose the file name, so as to > construct the arguments for the next two steps - call Tesseract > - call the word processor that's what I'd do I just made a quick hack of a script I've used for something (like archiving log files): it looks through one or more of predefined directories, and, ocrs any predefined file types (like TIF and FAX) --------- 0[roman][F:\ute]ocr 0[roman][F:\ute]SET TESSDATA_PREFIX=F:/ute/tesseract/usr/share/ ocr processing ocr, logging to \logs\ocr.log .... processing directory \scanner .... ... processing for extension tif .... ... processing for extension fax 0[roman][F:\ute\tesseract\usr\bin]tesseract F:\scanner\FX004164.FAX F:\scanner\F X004164 -l eng Tesseract Open Source OCR Engine 0[roman][F:\ute\tesseract\usr\bin]tesseract F:\scanner\FX004167.FAX F:\scanner\F X004167 -l eng Tesseract Open Source OCR Engine .... processing directory \scanner\out .... ... processing for extension tif .... ... processing for extension fax ---------- ymmv ---- /* ocr.cmd */ /* you MUST have following: tesseract ocr application ' this does very little (none ?) in the way of error checking, if your application hold logs open, this will skip and log any open logs if you don't know what it all means, stop now tesseract F:\FAXBANKSIA\FX006309.FAX test -l eng */ /* user defines below */ extlist= 'tif fax' /* list all target extensions to process */ dirlist= '\scanner \scanner\out' /* list all target dirs to process */ logdir= '\logs\' /* target dir for logs, NEEDS trailing '\' */ arch= 'tesseract image text -l eng' /* command to execute against targets */ arch= 'tesseract' /* arch= 'pause' */ /* UNCOMMENT this line for testing ? perhaps .. */ ocrdir= '\ute\tesseract\usr\bin' /* where is t exe ? */ 'SET TESSDATA_PREFIX=F:/ute/tesseract/usr/share/' /* t's libs are there */ /* user defines end */ IF RxFuncQuery('SysLoadFuncs') THEN /* assume libraries loaded ...*/ DO CALL RxFuncAdd 'SysLoadFuncs', 'RexxUtil', 'SysLoadFuncs' CALL SysLoadFuncs SAY '... loading REXX Utilities libraries ...' END call time(e) /* let's time it */ call logfile say thiscmd 'processing ocr, logging to' log curdir = directory() /* get where we are */ wdirlist=dirlist /* load rubber bullets */ wextlist=extlist DO WHILE wdirlist >'' /* loop for all target paths */ PARSE VAR wdirlist target wdirlist say '... processing directory 'target DO WHILE wextlist >'' /* loop for all extensions */ PARSE VAR wextlist ext wextlist say '... ... processing for extension 'ext call SysFileTree target||'\*.'||ext, 'file', 'FO' /* find out all target LOGs */ do i=1 to file.0 /* we assume there are several valid targets... */ /* make sure NOT in use */ if (stream(file.i,'C','OPEN WRITE') = 'READY:') then DO call stream file.i, 'C', close /* we need to strip FQPFname to a simple file name */ PARSE value file.i WITH name '.' dump CALL directory(ocrdir) 'tesseract' file.i name '-l eng' END else call LINEOUT log, date() time() 'skipping 'file.i', file in use' end end /* do while wextlist */ wextlist=extlist /* reload */ end /* do while wdirlist */ call LINEOUT log, date() time() thiscmd' using 'arch' in 'dirlist' for extensions 'extlist' in 'TRUNC(time(e)) 'sec.' call LINEOUT log /* let's log we done it */ CALL directory(curdir) EXIT logfile: parse source . . thiscmd /* lets see what are we running, and set log & par */ at_char=LASTPOS('\' , thiscmd) thiscmd=SUBSTR(thiscmd , at_char + 1 ) parse value thiscmd with thiscmd'.'ext log = logdir|| thiscmd || '.log' return /* fully guaranteed never to overwrite any CD ROM */ ---- -- Voytek ---------------------------------------------------------------------------------- **= Email 2 ==========================** Date: Sat, 2 Aug 2008 00:02:44 +1000 (EST) From: "Voytek Eymont" Subject: Re: Tesseract > > I haven't used OCRing - didn't know it existed. Does it actually produce > a text file and not a scanned copy? I know that you use PMfax a lot, so do > you scan docs into PMfax and then use OCRing to convert them to text? I > want to scan articles and convert them to text formats that I can store or > send to friends. That means that they have to be converted to *.doc files > or *.pdf files. So far I have found that Tesseract does a good job but it yes, I scan with CopyShop into PMfax; generally, I send out the PMfax TIFF-F files, ocassionaly, make TIFF into PDF I ocassionally OCR scanned stuff to text, not very often no, 1st you need a scanned copy, then, you ocr part or all of it, I generally OCR to clipboard (default in PMfax), then paste into whatever -- Voytek ---------------------------------------------------------------------------------- **= Email 3 ==========================** Date: Sat, 02 Aug 2008 21:03:33 +1100 From: Alan Duval Subject: Re: Tesseract Peter Moylan wrote: > Alan Duval wrote: >> Dennis Nolan wrote: > >>> A better way is to create a Program object on your desktop. There is >>> a Program Object template in the Templates folder. >>> Make your OCR program the Object. From Memory you just need to drag >>> it to the Object when creating it. >>> During the creatioin you need to specify the dropped file as the >>> input parameter. The Help file that you can access in the program >>> object explain how to do this. >>> >>> If it is set up correctly you only need to drag and drop your tif >>> files on the object for it to do its stuff. >> >> I can drag and drop my tif files on the program object that I created >> and it will process it and save it to C:\OCR. >> >>> There is a way to get it to open your word processor too, but it's >>> been too long for me to clearly remember how I used to do it. >> >> That's what I now want but can't see how to do it. I can drag and >> drop the txt file that has been created on to the word processor >> object and it opens in the word processor but I would like that to >> happen without doing this second drag and drop. > > I haven't tried this (I don't even have Tesseract), but I think what > you need to do is write a simple Rexx script that does three things in > sequence: > - parse its program argument to decompose the file name, so as to > construct the arguments for the next two steps > - call Tesseract > - call the word processor > > For someone who is not familiar with Rexx (I don't know whether you > are), the only hard part is the parsing of the file name, and even > that is easy once you look up the Rexx manual because Rexx has an > explicit PARSE command. The rest is just like writing a batch file. > > Suppose this script is called "script.cmd". Then you can create a > program object which has the program name specified as "CMD.EXE" > (without the quotes), and the parameter string "/C SCRIPT.CMD" (also > without the quotes). The working directory should be the directory > where script.cmd lives. Alternatively, you can give a full path > specification for script.cmd, and set the working directory to be > where you want your data files to live. That part is not particularly > important, because you can always include CD (i.e. change directory) > commands in your script, or use full path names for every file that > has to be mentioned. > > On further thought, it's possible that the parameter string in the > program object should be something like "/C SCRIPT.CMD %1", or > something similar, to ensure that the parameter is passed to the > script. I can't check that now because I don't have OS/2 at work. > Thanks Peter, I don't much about REXX. All I've done is to write the HELLO command in REXX. I'm lost when I read the PARSE and CALL commands. I'd have to get a book and work steadily through it to know what I was doing. Regards, Alan ---------------------------------------------------------------------------------- **= Email 4 ==========================** Date: Sat, 02 Aug 2008 21:07:04 +1100 From: Alan Duval Subject: Re: Tesseract Voytek Eymont wrote: > > > >> I haven't tried this (I don't even have Tesseract), but I think what you >> need to do is write a simple Rexx script that does three things in >> sequence: >> - parse its program argument to decompose the file name, so as to >> construct the arguments for the next two steps - call Tesseract >> - call the word processor >> > > that's what I'd do > > I just made a quick hack of a script I've used for something (like > archiving log files): > > it looks through one or more of predefined directories, and, ocrs any > predefined file types (like TIF and FAX) > > --------- > 0[roman][F:\ute]ocr > > 0[roman][F:\ute]SET TESSDATA_PREFIX=F:/ute/tesseract/usr/share/ > ocr processing ocr, logging to \logs\ocr.log > ... processing directory \scanner > ... ... processing for extension tif > ... ... processing for extension fax > > 0[roman][F:\ute\tesseract\usr\bin]tesseract F:\scanner\FX004164.FAX > F:\scanner\F > X004164 -l eng > Tesseract Open Source OCR Engine > > 0[roman][F:\ute\tesseract\usr\bin]tesseract F:\scanner\FX004167.FAX > F:\scanner\F > X004167 -l eng > Tesseract Open Source OCR Engine > ... processing directory \scanner\out > ... ... processing for extension tif > ... ... processing for extension fax > > ---------- > ymmv > > ---- > /* ocr.cmd */ > > /* you MUST have following: > tesseract ocr application > ' > this does very little (none ?) in the way of error checking, > if your application hold logs open, this will skip and log any open logs > if you don't know what it all means, stop now > > tesseract F:\FAXBANKSIA\FX006309.FAX test -l eng > > */ > > > /* user defines below */ > > extlist= 'tif fax' /* list all target extensions to process */ > dirlist= '\scanner \scanner\out' /* list all target dirs to process */ > logdir= '\logs\' /* target dir for logs, NEEDS trailing '\' */ > > arch= 'tesseract image text -l eng' /* command to execute against targets */ > arch= 'tesseract' > /* arch= 'pause' */ /* UNCOMMENT this line for testing ? perhaps .. */ > ocrdir= '\ute\tesseract\usr\bin' /* where is t exe ? */ > 'SET TESSDATA_PREFIX=F:/ute/tesseract/usr/share/' /* t's libs are there */ > > /* user defines end */ > > > IF RxFuncQuery('SysLoadFuncs') THEN /* assume libraries loaded ...*/ > DO > CALL RxFuncAdd 'SysLoadFuncs', 'RexxUtil', 'SysLoadFuncs' > CALL SysLoadFuncs > SAY '... loading REXX Utilities libraries ...' > END > call time(e) /* let's time it */ > call logfile > say thiscmd 'processing ocr, logging to' log > > curdir = directory() /* get where we are */ > wdirlist=dirlist /* load rubber bullets */ > wextlist=extlist > > DO WHILE wdirlist >'' /* loop for all target paths */ > PARSE VAR wdirlist target wdirlist > say '... processing directory 'target > > DO WHILE wextlist >'' /* loop for all extensions */ > PARSE VAR wextlist ext wextlist > say '... ... processing for extension 'ext > > call SysFileTree target||'\*.'||ext, 'file', 'FO' /* find out all target > LOGs */ > > do i=1 to file.0 /* we assume there are several valid targets... */ > /* make sure NOT in use */ > if (stream(file.i,'C','OPEN WRITE') = 'READY:') then > DO > call stream file.i, 'C', close > /* we need to strip FQPFname to a simple file name */ > PARSE value file.i WITH name '.' dump > CALL directory(ocrdir) > 'tesseract' file.i name '-l eng' > END > else call LINEOUT log, date() time() 'skipping 'file.i', file in use' > end > > end /* do while wextlist */ > wextlist=extlist /* reload */ > end /* do while wdirlist */ > > call LINEOUT log, date() time() thiscmd' using 'arch' in 'dirlist' for > extensions 'extlist' in 'TRUNC(time(e)) 'sec.' > call LINEOUT log /* let's log we done it */ > CALL directory(curdir) > > EXIT > > > logfile: > parse source . . thiscmd /* lets see what are we running, and set log & > par */ > at_char=LASTPOS('\' , thiscmd) > thiscmd=SUBSTR(thiscmd , at_char + 1 ) > parse value thiscmd with thiscmd'.'ext > log = logdir|| thiscmd || '.log' > return > > /* > fully guaranteed never to overwrite any CD ROM > */ > > ---- > Thanks Voytek, However i'm afraid it's too complicated for me as I am not familiar with REXX. Regards, Alan ----------------------------------------------------------------------------------