05 March 2001. GoSWISH ver 1.6
Daniel Hellerstein, danielh at crosslink dot net
GoSWISH: A Search Engine Utility For OS/2
Abstract: GoSWISH is a free search engine for OS/2 Web Sites.
GoSWISH consists of ver 1.3 of the SWISH "web indexer",
a script to automate it's use, and
a script to provide a front-end to WWW clients.
--------------------------
Table of Contents:
I. Introduction
II. Installation
IIa. Installing as a CGI-BIN script
IIb. Installing as an SRE-http addon
IIc. Notes for upgraders
III. Using GoSWISH
IIIa. Using GOSWISH.HTM
IV. The GOSWISH.CMD program
IV.a. Invoking GOSWISH.CMD
IV.b. GOSWISH.CMD: Make a SWISH index file
IV.c. GOSWISH.CMD: Search a SWISH index file
V. The MKDCT.CMD program
AppA. Hints on using SWISH
AppB. Acknowledgements and Legal Stuff
--------------------------
1. Introduction
The Simple Web Indexing System for Humans (SWISH) is a fast (and free)
multi-platform program for generating & searching indices of the contents
of a set of files. SWISH is designed to be used as "keyword search" tool,
and will return output that can be incorporated into WWW output.
SWISH does the hard part of indexing (extracting words from selected
documents and organizing them in compact fashion), and of searching
this index for specified keywords. Although not terribly difficult
to use, it is a bit idiosyncratic. By providing a simple, web-based
interface to SWISH, GoSWISH can help the typical (i.e.; lazy and/or
overworked) web administrator use SWISH to keep his search tools
up-to-date and easy-to-use. In addition, GoSWISH makes it easy to
create, and display, summaries of the files that are found during
a keyword search.
GOSwish includes a copy SWISH ver 1.3. This is GNU-freeware (see the
appendix for further details).
To use GoSWISH you should have an OS/2 web server that understands CGI-BIN.
Especially useful is a web-server that can handle POSTed CGI-BIN requests
(since the set of options can get rather long).
Even better, GoSWISH can also be run (somewhat more efficiently) as an
addon for the SRE-http freeware web server for OS/2 (see
http://www.srehttp.org for details).
--------------------------------
II. Installation
First, you should UNZIP GOSWISH to an empty temporary directory.
Most people will want to use the INSTALL.CMD program (that
comes with GoSWISH) to install. This program will ask you whether
you are installing GoSWISH to be run as a CGI-BIN program, or
as an addon for the SRE-http web server. It will then ask you
for a few directories, modify a few files, and then copy them.
After it's done, you can try using GoSWISH!
For those who like to install software "by hand", the following
instructions can be followed.
IIa: Installing GoSwish as a CGI-BIN Script
In addition to copying files to appropriate places, the installation
of GoSWISH requires a few changes to the GOSWISH.CMD file; and possibly
to the GOSWISH.HTM file. Therefore, pay especial attention to step iv of
the following instructions.
0) Unzip GOSWISH.ZIP to an empty temporary directory
i) Create a SWISH subdirectory in some convenient locations. For sake
of explanation, let's assume that you create D:\HTTP\SWISH.
ii) Copy the following files to this directory (say, to D:\HTTP\SWISH):
SWISH-E.EXE
GOSWISH.CMD
iii) Copy GOSWISH.HTM to somewhere in your web tree. That is,
copy it to some place accessible via the www. For sake of
explanation, let's assume you copy it to E:\WWW, and E:\WWW is
the "root" of your web tree.
Then, you must edit GOSWISH.HTM (with your favorite text editor) and
make the following changes:
a) Change all occurences of the string "search_document_directory"
(without the quotes) to the relative directory you'ld like
your "search documents" written to by default
(or, you change it to an empty string).
* For example, change it to "SWISH/" (without the quotes).
b) Change all occurences of the string "GOSWISH.CMD"
(in URLs and in forms) to /CGI-BIN/GOSWISH.CMD (or whatever
string your server uses to signify CGI-BIN scripts).
iv) With your favorite text editor, edit GOSWISH.CMD (the copy in
you SWISH directory) and change the following parameters (they'll be
very clearly marked):
a) SWISH_DIR -- the directory used to store "SWISH" indices.
b) WEB_ROOT_DIR -- The root of your web tree.
For example:
SWISH_DIR='D:\HTTP\SWISH'
WEB_ROOT_DIR='E:\WWW'
c) If you are planning on using the "regenerate swish index"
option, and your web server uses something other then
/cgi-bin/ as a a "cgi-bin script" signal, then you should also
change the CGI_STRING variable.
v) Copy this "modified" version of GOSWISH.CMD to your CGI-BIN
directory.
Assuming that there are no other access restrictions (is there
an HTACCESS file you need to modify?) you are now ready to use GoSWISH!
--------------------------------
II.b. Installing GoSwish as a SRE-http addon
SRE-http users can use the above instructions and run GoSWISH as a CGI-BIN
script. However, for purposes of speed and flexibility, we recommend running
GoSWISH as an SRE-http addon.
In addition to the above steps:
vi) As part of step iii, you might want to modify the NEED_PRIVS
parameter in GOSWISH.cmd.
NEED_PRIVS is used to limit who can create new SWISH indices.
vii)Instead of (or in addition to) copying to the CGI-BIN directory, copy
GOSWISH.CMD to your SRE-http "addon" directory. If you are using a
version of SRE-http newer then 1.2L.1297d, you can delete the version of
GOSWISH.CMD on "D:\HTTP\SWISH". Otherwise, you'll have to keep the
two copies synchronized (later versions of SRE-http keep track of
where a script is running from).
viii) Optional. Copy MKDCT.CMD to your SWISH (i.e.; D:\HTTP\SWISH) directory.
In addition, you may be interested in the following "sample" files; you
should copy them to your SWISH directory.
SAMPLE.CON : a sample SWISH "configuration" file
SAMPLE.SWI : a sample SWISH "index" file
SRCHSAMP.HTM : a sample "use GoSWISH to search an index" document
SAMPLE.DCT : a sample "description cache file"
MKDCT.IN : a sample "list-of-URLS" (used by MKDCT.CMD).
DESCRIBE.TXT : a sample "directory-specific" description file
(used by MKDCT).
IIc. Notes for upgraders:
GoSWISH 1.4 incorporates a number of changes, many of which are "under
the hood". The most important change is support for SWISH 1.3;
which is shipped as SWISH-E.EXE (a copy of SWISH-E.EXE is included
in GOSWISH.ZIP).
SWISH 1.3 is less buggy then SWISH 1.1, and supports some useful
new features. Unfortunately, the "index" files created by SWISH 1.3
are different (just a bit, but enough) from those created by SWISH 1.1.
Therefore, if you have older SWISH index files (say, as created by
GoSWISH 1.2 or before), you may want to NOT use SWISH 1.3. GoSWISH
will work with SWISH 1.1 and SWISH 1.2 (albeit with these
new features disabled), but you must set a few parameters.
In particular, the SWISH_VERSION parameter in GOSWISH.CMD must be set to:
SWISH_VERSION=1.1
(see GOSWISH.CMD files for details).
As of version 1.44, GoSWISH is now distributed with rxSWISH.DLL -- a
dynamic link library that emulates SWISH-E.EXE (ver 1.3). We recommend
using this -- just set the SWISH_VERSION parameter to:
SWISH_VERSION='13_DLL'
Hint: you should move (or copy) rxSWISH.DLL (from your server's root
directory) to a directory in your LIBPATH (such as x:\os2\dll).
Lastly -- see the READ.ME for late breaking news.
Minor notes:
* SRCHINDX.CMD is no longer supported -- for all intents and purposes,
all it's functionality is now incorporated into GOSWISH.CMD
* The WWWDIR option is no long supported -- you can't change the
"WEB_ROOT_DIRECTORY" on the fly. But, since you can specify fully
qualified directories, and multiple replacement rules, this capability is
no longer needed.
* Note the use of SWISH_DIR and WEB_ROOT_DIR instead of INDEX_DIR and WWW_DIR
* The HEADER option has been modified (it no longer adds
* H1 and H2 options, similar to HEADER, now available
* GoSWISH can now tell SWISH-E to use "stemming" rules when indexing
* "Property" retention, with later display, is supported in GoSWISH.
* FOOTER_FILEs and HEADER_FILEs are now assumed to be in the
SWISH_DIR directory.
--------------------------------
III) Using GoSWISH
To get started with GoSWISH, we recommend GOSWISH.HTM: it provides a complete
interface to GoSWISH and to SWISH. With GOSWISH.HTM you can:
a) Specify the directory (fully qualified, or web relative) to create an
index of,and then create the index.
b) Create "summaries" of all text (plain and html) documents encountered
whilst creating the index.
c) Create an HTML document that allows you to enter keywords, and then search
the index you just created.
The HTML documents created in step c can then be made "available to the
public". Alternatively, GoSWISH automatically tracks the various indices
created (say, one for each of several major areas), and you can instruct
GoSWISH to offer a menu listing these various indices.
More ambitious users can structure their own "calls" to GOSWISH.CMD. The
various options understood by GOSWISH.CMD are listed in Section IV.
You may have noticed that we've mentioned "descriptions" and "summaries" a few
times. This refers to the generation of short (300 word) summaries for
matched files. Thus, not only will matching files be found, but descriptions
(extracted from the contents of the file) can also be displayed.
An important point should be remembered: the SWISH "index of your site" is a
static document -- it will NOT reflect recent changes in the contents of your
website. Frequent recreations of this index may be necesssary (hopefully,
GoSWISH will make that easy to do).
--------------------------------
IIIa. GOSWISH.HTM
GOSWISH.HTM automates practically everything having to do with index creation
and use -- you may never need to use any other tool. However, please do not
hesitate to modify, change, or otherwise chop GOSWISH.HTM to it's constituent
pieces.
GOSWISH.HTM can be used for three purposes:
a) to create a Swish Index (and a GoSWISH description-cache file)
b) to search a Swish Index (and a GoSWISH description-cache file)
c) to list currently available "search forms" (created by GoSWISH)
GOSWISH.HTM has two index creation modes: a "simple" and a "custom" mode.
In most cases the simple mode will be quite adequate. The custom mode is
actually quite similar, with defaults that are the same as those used
when the "simple" mode is used. Do note that the "custom" mode uses
a POST style request; so your server must understand POST requests.
Regardless of which creation mode you use, GoSWISH will activate GOSWISH.CMD.
GOSWISH.CMD will then
a) create a "SWISH configuration" file,
b) launch SWISH in a new process on your server, and feed it this
configuration file
c) if desired, it will also generate generate descriptive summaries
(and store them in a description-cache file).
d) A short status document will be returned to you (the client), which will
contain a link to a sample "search this index" HTML document. This
document contains a form that allows you to specify keywords, and a
few simple options (such as number of matches). In addition, if you
chose to create summaries, you can also specify whether or not to
display summaries.
This document can be used as is. Or, you can append, amend or
otherwise modify it. If you want to use it "as is", you might need to
wait until the index (and descriptions) creation (of steps b and c) are
completed; a task which may take a few minutes (depending on the number
and size of the files to be indexed).
NOTE:
Notes:
* GOSWISH will launch daughter processes; so reciept of a complete
response from the server does not mean all the work's been done.
To avoid this uncertainty, SRE-http users can "monitor swish while
it runs", as well as monitor the generation of descriptions.
Unfortunately, due to the simplicity of the CGI-BIN protocol, when
run as a CGI-BIN script this "monitoring" feature is not available.
* GoSWISH will generate a sample "search this index" document that
uses the "search mode" of GoSWISH. You can use this search document
as is (it's a reasonably efficient interface), or you can modify
with your favorite text editor. If you want to take a shot at
this sort of customization, you should read section IV.c.
--------------------------------
IV. GOSWISH.CMD
The following describes the various GOSWISH.CMD options. Note that many of
these refer to SWISH options -- see Appendix A for an overview of what the
SWISH options do. In addition, the "custom" section of GOSWISH.HTM describes
most of these options.
IV.a. Invoking GOSWISH.CMD
GOSWISH.CMD can be invoked with a URL of the form
/cgi-bin/GOSWISH?mode=x&option1=val1&option2=val2&etc.
or, if you are an SRE-http user:
/GOSWISH?mode=x&option1=val1&option2=val2&etc.
The MODE argument determines what type of action GOSWISH.CMD will
perform. MODE can take one of the following values:
MODE=M : Create a swish index
MODE=S : Search a SWISH index mode
MODE=L : List currently available SWISH indices (and provide
links to forms that can be used to search them)
MODE=REGEN : List currently available SWISH configuration files,
which can be used to regenerate a SWISH index
This is provided as an alternative to running
SWISH-E (with a pre-existing configuration file)
from an os/2 command prompt. Note that "descriptive
summaries" are NOT regenerated, only the SWISH index.
MODE=2REGEN : Regenerate a swish index (using a given swish
configuration file). MODE=2REGEN is generated
by MODE=REGEN -- it will not be described in
this document.
Example: MODE=M
The remaining options understood by GOSWISH depend on the value of MODE:
one set of options is used when MODE=M, and a second set
is used whtn MODE='S'. Note that when MODE='L' or MODE='REGEN',
no other options are used, and when MODE='2REGEN', a set of
filename options are used.
The following sections describes the GOSWISH.CMD options. Note that
some of these examples presume you are using a URL to invoke GOSWISH, hence
URL encoding rules (such as using a + for a space) are displayed. However,
if you are using a FORM with INPUT elements, you should convert these encoded
characters.
But before listing these options, please note that GOSWISH.CMD contains a
number of configuration parameters (some of which can be overridden by
the following options). You can change these parameters by editing
GOSWISH.CMD with your favorite text editor -- the parameters are
in a section at the top of the file, and are documented.
--------------------------------
IV.b. GOSWISH.CMD: Creating a SWISH Index (mode=M)
In "create a swish index" mode, there are four classes of options:
file options, indexing rules, description options, and other options.
** File Options: You MUST set the SEL option (defaults are used
for the others).
SEL : The directories to be searched.
You can enter..
relative directories (directories that do not have a drive letter), or
fully qualified directories.
in a space delimited list.
Relative directories are assumed to be under (subdirectories of) the
WEB_ROOT_DIR directory.
Example: SEL=/ (search WEB_ROOT_DIR and all it's subdirectories)
SEL=DIR1/
SEL=/DIR1 /DIR10 /DIR2
SEL=/DIR1/ /DIR2/* /DIR3/
Notes:
* swish will index the subdirectories of each directory
(relative or absolute) that you enter.
* To specify an explicit set of files in a directory, but not in
subdirectories (of a directory), you can use * as a wildcard.
For example:
samples/ means all files in samples/, and in subdirectories of samples
samples/* means all files in samples/
but NOT in subdirectories of samples/
samples/foo*.* means all files that match foo*.* in samples/
but NOT in subdirectories of samples/
* trailing and leading / (or \) characters in relative directory
names are strictly optional (they will be removed and added
as need be).
SWIFILE: The name of the "swish index" to create. If not a fully qualified
name, it will be written to the SWISH_DIR directory. If not
specified, a unique random name (in SWISH_DIR) will be used.
SWISHVERSION: Which version of SWISH to use. This overrides the default
SWISH_VERSION variable set in GOSWISH.CMD.
SWISHVERSION can take one of the following values:
11 == version 1.1
12 == version 1.2
13 == version 1.3
Example: SWISHVERSION=13
SEARCHDOC: The name of the "search this index" HTML document. Relative
names are assumed to be relative to the WEB_ROOT_DIR. If not
specified, a unique random name (in WEB_ROOT_DIR) will be
used.
If a fully-qualified name is used, you should also include
the "selector" that will invoke this file. For example:
SEARCHDOC=D:\WWWNEW\SDOCS\SEARCHX.HTM+/WWW2/SEARCHX.HTM
... note the use of a "url-encoded space" (the +) to delimit the
fully qualified file name and the "selector").
OVERWRITE If set to 1, filenames will overwrite pre existing file names.
Note that this overrides the OVERWRITE variable in GOSWISH.CMD
** Indexing Rules:
EXTLIST : List of extensions to index -- if a file does not have one
of these extensions, it will be ignored.
EXTLIST_NOFOLLOW: Do NOT extract "words" from these documents.
You should ONLY extract words from text documents
(such as HTML files).
DOSTEM If 1, then use the "stemming algorithim". If 0, do NOT use
the stemming algorithim. The default value is 1
Example: DOSTEM=1
METANAMES List of name fields to assign words to.
If you specify a list of such fields (such as DESCRIPTION and
CONTENT), then you'll be able to search explicitily for words
appearing in these elements in HTML files. For example,
if you specify:
METANAMES=DESCRIPTION
then you can later search using a KEYWORD
DESCRIPTION=myword
and SWISH will find all files that have the word "myword" in
a "description" element.
There is one drawback -- these words will NOT be found under a
usual search (however, you can "or" together normal keyword
searches and keyword searches).
PROPNAMES List of "property names" (defined as values of META tags)
to retain for each file. This is extra descriptive information
that can be shown with other search results.
Example: PROPNAMES=DESCRIPTON+AUTHOR+MODIFIEDDATA
IGNORELIMIT: Used to ignore "common" words that occur too frequently
IGNOREWORDS: A set of common words to ignore (such as "the" and "or").
If not specified, a list of about 1000 "swishdefault"
words are used.
REPWITH: SWISH will store files using fully qualified file names. If you
want to store URLS (a requirement if you want clients to be
able to "click to recieve" matched documents), the REPWITH can
be used to specify a "replacement rule". By default, a
default REPWITH is used (that will create a selector that is
relative to the value of the SEL option). Note that
if you specify multiple directories to search, by default
a seperate replacement rule will be generated for each
directory you specify.
If you specify any REPWITH rules, default ReplaceRules will
NOT be generated!
These FR_ options are used to suppres indexing of files and directories.
Please see SWISH.HTM for details.
FR_DIRECTORY: the "FILE RULES DIRECTORY " instructoins
FR_TITLE: The "FILE RULES TITLE" instructions
FR_FILENAME: The "FILE RULES FILENAME" instructions
FR_PATHNAME: The "FILE RULES PATHNAME" instructions.
These index options are stored in the swish index file; they just
provide identifying information.
INDEXNAME : Name of the index
INDEXADMIN: Administrator
INDEXPOINTER: Pointer to this index
INDEXDESCRIPTION: Description of the index
** Description options.
The "description-cache" (DCT) file is created by extracting informaton from each
matched document. For HTML documents, a META NAME="DESCRIPTION" or
HTTP EQUIV DESCRIPTION element is used (if available).
For example:
Otherwise, values of headers are used.
For non-HTML documents, the first few hundred characters are used.
Note to SRE-http users: The MKDCT program can also be used to create
a "description-cache file".
MAKESUMMARY: Make a descriptions-cache (DCT) file (that contains file summaries)
Can be one of :
0 = Do NOT create a DCT file. If you select
0, then descriptions will NOT be available (of course,
you can always rerun GOSWISH to make descriptions at a later
date)
Note: on sites with well document HTML files,
you can use DESCRIPTION (and other such)
"properties" instead of summaries.
1 = Read descriptive summaries from a DESCRIBEFILE.
2 = Read descriptive summaries from a DESCRIBEFILE.
If no such directory-specific description file exists,
and this is a "text" file, then create a descriptive summary
by examining the contents of the file.
Note: "text files" are defined as "indexed" files -- files that
do NOT match the EXTLIST_NOFOLLOW list.
DCTFILE: The name to use for the "descriptions-cache" (DCT) file. Relative names
are written relative to the SWISH_DIR. If not specified,
a random name (in SWISH_DIR) is used.
DESCRIBEFILE: A text file that contains explicit descriptions. The DESCRIBEFILE,
if specified, is an "own directory" specific file -- a seperate one
should be specified for each directory (and subdirectory) being
indexed. If an entry for an indexed file is found in the "own
directory DESCRIBEFILE, then the associated description is used
(a description will NOT be constructed from the contents of the file).
DESCRIBEFILE files should look like:
filename.ext a description
filenam2.ext another deDscription
filenam3.ext another description, this one on
| two lines
Example: DESCRIBEFILE=DESCRIBE.TXT
** Other options.
WATCH: If WATCH=1, and you are running GOSWISH as an SRE-http addon,
then status information will be shown.
Note: when run as a CGI-BIN script, WATCH is ignored (status
is not "watched". However, GOSWISH will "START" SWISH --
a descriptive name will show up in the task list.
When run as an SRE-http addon, programs are DETACHed (and will
not show in the task list).
DOSTEM: If DOSTEM=1, then apply a "stemming" algorithim when indexing.
This algorithim will strip stems (sucn as "s", "ed", etc.)
from words before indexing them.
INDEXCOMMENTS: If INDEXCOMMENTS=0, then SWISH will index HTML comments.
If INDEXCOMMENTS=1, comments will not be indexed.
** Example.
Note that this would typically be a single long request,
or would be the body of a POST request (als note use of URL encoding):
/CGI-BIN/GOSWISH?
mode=M&
sel=/SAMPLES&
swifile=&
searchdoc=&
repwith=&
extlist=.htm+.txt+.gif+.jpg+.doc+.sht+.html+.shtml&
extlist_nofollow=+.gif+.xbm+.jpg+&
fr_pathname=contains+admin+testing+demo+trash+construction+PRIVATE+private+confidential+&
fr_directory=contains+.htaccess+&
fr_filename=contains+%23+%25+%7E+.bak+.orig+.old+old.+&
fr_title=contains+construction+example+pointers+&
ignorelimit=50+100&
ignorewords=SwishDefault&
indexname=&indexadmin=&indexdescription=&indexpointer=&
makesummary=2&
htmls=+HTM+HTML+SHTML+SHT+&
describefile=DESCRIBE.TXT&
watch=Y
--------------------------------
IV.c.GoSWISH MODE=S options
MODE='S' is the "search a SWISH" index MODE. The following parameter
can be fed to GoSWISH, either as part of an HTML form, or as
part of a (possibly quite long) URL.
INDEX:
A space delimited list of "SWISH" indices.
Each index in this list may be a fully qualified name, or
a relative name.
If relative name is used, it is assumed to be relative to the SWISH_DIR.
DCT_FILE:
A space delimited list of description-cache files.
Each index in this list may be a fully qualified name, or
a relative name.
If relative name is used, it is assumed to be relative to the SWISH_DIR.
EXIST
If EXIST=1, then GoSWISH will check to see if the matches are accessible.
When matches are URLS (which is the usual case), socket calls are used.
If the URL does not exists, the match will be displayed, but will not
be linked.
In general, use of EXIST=1 is NOT recommended -- it dramatically increases
response time.
CONDITION
Controls search logic. CONDITION should set to OR or NOT.
Depending on the value of CONDITION, an "OR" or a "NOT" will
placed between keywords.
If CONDITION is not included, "AND" is used.
* Keywords for which an explicit "logical control" was
included will not be effected by the CONDITION parameter.
That is, CONDITION only applies to keywords that do not have
a preceding AND, OR, or NOT.
* Caution: CONDITION does not work well with (phrases) or
in combination with complex user specified search strings.
Example:
Note: the list of "keywords" can also contain OR, NOT, AND and ( ) --
CONDITION will NOT override these explicit boolean terms.
COMMENT
Comments to place (using ) under header. You can include multiple
COMMENT elements.
Example:
FOOTER_FILE
A file to use as a footer.
* The FOOTER_FILE is assumed to be relative to the SWISH directory
(or to a virtual directory)
* Server side includes will NOT be attempted on the footer_file.
Example:
HEADER_FILE
A file to use as a header. If HEADER_FILE is specified, the
HEADER, H1, and H2 options are ignored (COMMENTS are NOT ignored).
* If you use a HEADER_FILE, you MUST include a statement in it.
* The HEADER_FILE is assumed to be relative to the SWISH_DIR director --
either in it, or in a subdirectory of SWISH_DIR.
* Server side includes will NOT be attempted on the header_file.
Example:
HEADER, H1, and H2
A header to display (at the top of the results page).
A default header is used if neither HEADER_FILE or HEADER is specified.
H1 will automatically prepend an
, and append a
, to
your header.
H2 will automatically prepend an
, and append a
, to
your header.
You should only use one of HEADER, H1, or H2
Example:
KEYWORD
A space delimited list of words to search for, with OR AND NOT used
as (optional)logical controls (AND is assumed).
If KEYWORD is not included, a keyword of HELP is used.
Example (from a FORM):
Another example, assuming you defined a METANAME of AUTHOR
enter author's name
(the user should enter the author's name after the
AUTHOR= in the input box)
OPTION
A search-modification option. You can include multiple OPTION elements.
Valid OPTIONs include:
-t HBthec : search in head, body, title, header, emphasized, or comments
-m nn : display nn maximum of nn matches
For a description of these options, see the SWISH documentation at:
http://www.eit.com/software/swish/swish.html
or run SWISH (from an OS/2 command prompt) for a short synopsis.
Examples (display 10 matches, searching HTML documents only):
SPECIAL OPTIONS (requires a DCT_FILE)
Option="FILE"
Instead of looking for keywords, you can search for "URL names"
that contain a matching substrings.
Option="SUMMARY"
Searches the "automatic description" for matching substrings.
* The URL name will be displayed as the link (instead of
the TITLE). However, the descriptions will be the same as
those used in regular keyword searches through the SWISH index.
* When file or summary search is selected, the HBthec options are
ignored.
Example:
SEARCH_LINK:
Specify the target for a "New Search?" link (which will be displayed
at the bottom of results pages. This should be a valid URL pointing
to your search form.
Example:
SHOWPROP:
Specify "properties" to display (if available).
You can specify multiple occurences of SHOWPROP -- a cumulative
list is built.
In addition to property names, you can include a special "property"
of _SUMMARY_n (n=0, 1, or 2) These are synonymous with specifying
a SUMMARY=n option.
Examples:
(note that BOTH of these could be specified simultaneously --
which would mean "display both the Description and the Author
properties for each matched file).
Warning: in order to use SHOWPROP, you MUST have used an
appropriate PropertyNames option when you created
the SWISH configuraton file. For example:
PropertyNames description Author
START:
Display the first m matches, starting with the START match.
By default, START=1.
The most frequent use of START is to tell GoSWISH to "make links
to the next 20 matches. To do this, you should use a special
form of the START option:
START=1 0
.... that is, a 1, a space, then a 0 (or, if used in url: START=1+0)
GoSWISH will interpret this to mean
* "start at the #1 match and display the selected quantity of matches
* if there are undisplayed matches, provides links to the
next (or prior) set of matches
Please note that an inefficient algorithim is use:GoSWISH will re-search
the SWISH index, and selectively display the appropriate matches (say, matches
21 to 40 if you specified "show 20 matches".
Nevertheless, this option does give you the ability to display lots of
matches " a page at a time".
Examples:
START=20
START=1+0 (where the + is a URL-encoded space)
Note: the "m" (in "first m matches") is specified by using
an OPTION. For example, to specify "display 10 matches
at a time, starting from the first match":
SUMMARY:
Display a summary.
summary=0 : do not display summaries
summary=1 : display summaries.
summary=2 : display summaries; if no summary
is available, try to create one by reading
the file, or the URL of the file (depending
on what appears in the G.
Notes:
* If summary=1, then a DCT_FILE must be specified.
* If summary=2, then a DCT_FILE should be specified,
but need not be. However, we do NOT recommend using SUMMARY=2,
since creating summaries "on the fly" can bog down your server.
* See SHOWPROP for another way of requesting display of summaries.
* Summary display will indicate the "source" of the summary:
>> if from a "DESCRIBEFILE", or from a element, then
the standard font is used
>> if generated from a non-html text file, or from the
of an HTML file, font is used.
** Example (note use of URL encoding):
/CGI-BIN/GOSWISH?
MODE=S&
INDEX=index32.swi&
DCT_FILE=index32.dct&
KEYWORD=daniel&
COND=AND&
OPTION=-m+20&
SUMMARY=1&
HEADER=Search+of+my+files
* Miscellaneous comments
* Please remember that the SWISH "index" of your directory is a static
document, and will not reflect subsequent changes in the contents of
your site (this is also true of the "description file").
So, if you make substantial changes in site content, you should
rerun GoSWISH.
* If you do NOT need the "keyword search" features (that is, you only want
to search filenames or summaries), you can skip the use of SWISH. This
does require providing MKDCT with a list-of-URLs (see MKDCT for
details).
* The "search documents" produced by GoSWISH can be easily modified.
In particular, you can add HEADER_FILE, FOOTER_FILE, and COMMENT options.
* GoSWISH will auto-detect whether a the target SWISH index is version 1.1
or version 1.3, and will run the appropriate version of SWISH (assuming
that SWISH 1.1 is named SWISH.EXE and ver 1.3 is SWISH-E.EXE.
* When specifying multiple index files, you can NOT mix version 1.1 and
version 1.3 swish indices.
--------------------
--------------------------------
V. The MKDCT program
MKDCT.CMD is a standalone program used to create a "description-cache"
(DCT) file". The output of MKDCT differs from the description file that
can be (optionally) produced by GOSWISH.CMD in several ways:
a) You can create either "regular" or "structured" DCT files (see
below for a description of these two forms of DCT files).
b) You can run MKDCT at any time. In contrast, GOSWISH only produces it's
description file while producing the SWISH index.
c) MKDCT has a few extra options.
d) MKDCT contains a simple "description-cache file" editor.
MKDCT has two file selection modes: a "SWISH" mode and a "List of URL's" mode.
* SWISH mode: The SWISH mode uses the ".CON" file you used to
create the SWISH index; and the ".SWI" SWISH index file.
* List-of-URLS mode: The List-of-URLs mode requires a text file containing
"URLS" to be examined (see MKDCT.IN for a simple example).
Entries in these files should have the following form:
URL " short description" byte_size fully_qualified_filename
where the URL should be a valid "link" to your site, and the
last three terms are optional.
In general, if you've gone to the trouble of obtaining and using SWISH,
it's probably easier to use the SWISH mode.
When it comes to creating the descriptions, either generate
descriptions for HTML and plain-text files by examining the
contents of the file, or it will examine "directory-specific" description
files -- text files that may be in each of the (several) directories
being indexed. Each of these (possibly several) files should contain
descriptions about the files in it's own directory.
The basic structure of these "directory-specific" description files is
simple. It should contain entries that look like:
filename.ext a description
filenam2.ext another description
filenam3.ext another description, this one on
| two lines (this is the continuation of the filenam3.ext desciption)
These descriptions can be of any length -- just be sure to start the
"continuation lines" with a | character. Furthermore, the files can be
of any mime type -- they need not be "HTML" or "plain text" files.
Lastly, you should NOT include path information on the filename.ext
portion of an entry -- a given "directory-specific" description file
ONLY refers to files in "it's own directory".
Structured vs. Regular DCT Files
"Regular" DCT files are the same as DCT files produced by GoSWISH.
"Structured" DCT files (which can be read, but not generated, by
GoSWISH) contain the same information, but use structured records
to speed up data retrieval. While not important for an index of a small
(say, less then a few hundred) set of files, for large (several thousand)
indices, extraction of summaries from structured DCT files can be
several times faster.
Other then the need to use MkDCT, there is one disadvantage to
structured DCT files -- they can NOT be combined. Regular DCT
files can be combined, say by using an HTML form statement of:
Notes:
* As with SWISH's index, the description file is permanent (at
least until you delete or replace it). Thus, changes to the contents
of your files will not effect the descriptions (nor will such changes
effect the SWISH index).
* MKDCT will ask you to supply a fully qualified name for the
description file.
* At the top of MKDCT.CMD are a number of user changable parameters.
For example, you can modify the value of the | "continuation flag".
Of more general use, if you intend to use MKDCT frequently (for
example, if your site is changing rapidly), you may want to change
some of the default file names.
* Hint: Creating an DCT file for a large set of files can monopolize your
CPU for several minutes. If you do not want to bog your machine down,
and are willing to accept a longer completion time, you can
instruct MKDCT to "run at a lower priority".
--------------------
Appendix A) Hints on using SWISH
The following offers a brief description of how to run SWISH as
a standalone program. We do not necessarily advocate running SWISH
directly (i.e.; rather then running it via GoSWISH).... it's a matter
of taste.
Serious users should obtain and read the SWISH documentation,
which can be found at http://sunsite.berkeley.edu/SWISH-E.
It's actually fairly well written!
However, for those who aren't real ambitious, the following will probably
be all they really need to know to use SWISH effectively! Note that
this example does NOT use features unique to SWISH 1.3 (it will work
with both SWISH 1.1 and SWISH 1.3).
First,as mentioned in the installation section above, two samples files
are included:
two sample files:
SAMPLE.CON : A "configuration" file used by SWISH
SAMPLE.SWI : The results of using SAMPLES.CON, ready to be used
as an "INDEX" file.
SRCHSAMP.HTM : An html document that calls uses GOSWISH to search
sample.swi.
Since SAMPLES.CON is tersely documented, let's discuss some of it's more
important variables.
IndexDir
A space delimited list of "directories" to search (note that
subdirectories of these directories are also searched). These should
be fully qualified directories (though you don't need the drive
letter).
Note that in SAMPLES.CON, two directories are indexed:
SAMPLES and IMGS. Further note that we assume that
the GoServe data directory is \WWW.
IndexFile
The "index" file generated by SWISH, and used by the INDEX
option of GOSWISH. Since we try to be FAT friendly, we usually
give it a .SWI extension, but you can call it anything.
IndexOnly
Only files with these extensions will be indexed.
NoContents
Files with these extensions will only have their names
indexed (that is, their contents will not be examined).
This may not work all the time (SWISH has several such
bugs in it).
ReplaceRules
Replaces the portion of the file name with some other string.
THIS IS CRUCIAL -- if you don't get this one right,
the links created by GOSWISH will NOT work.
Note that ReplaceRules only seems to work on strings listed
in the IndexDir. Thus, you can't give different ReplaceRules
to different branches of a subdirectory tree (unless each
branch was explicitily mentioned in the IndexDir option)
You can use any string in the "convert to" portion of your
ReplaceRules. However, note the following:
FileRules
Used to suppress reporting certain directories.
Caution: FileRules seem to be a bit flakey, you may want to
experiment them.
Once you've created your .CON file, you can run it through SWISH
(using the -c option). Then, run it through SWISH again, using the
-f swish_index_file_name -w word1 word2
---------------------
Apppendix B. Acknowledgements and Legal Stuff
The original creator of SWISH (in 1994) was Kevin Hughes (then of
EIT). Custody of the rights have since passed to UC Berkeley, which
distributes new versions of SWISH as GNU style freeware
(see http://www.fsf.org/copyleft/gpl.html for the generic
GNU license).
The current (February 1999) home page for the SWISH project is:
http://sunsite.berkeley.edu/SWISH-E/
If you are interested in the complete SWISH for OS/2 package, you can
look for it on:
http://sunsite.berkeley.edu/SWISH-E/Ports/OS2/swishe131.zip,
hobbes.nmsu.edu (search for SWISH)
or the SRE-http home page
http://www.srehttp.org/pubfiles/swish11.zip (swish 1.1), or
http://www.srehttp.org/pubfiles/swish13.zip (swish 1.3)
GoSWISH was developed by Daniel Hellerstein (danielh at crosslink dot net).
It too is freeware, with the following GNU-like disclaimer:
Copyright 1998,1999,2001 by Daniel Hellerstein.
Permission to use the GoSWISH software package for any purpose is hereby
granted without fee, provided that the author's name not be used in
advertising or publicity pertaining to distribution of the software
without specific written prior permision.
With some provisos, this includes the right to subset and reuse the code,
with proper attribution.
Furthermore you may also charge a reasonable re-distribution fee for
GoSWISH; with the understanding that this does not remove the
work from the public domain and that the above provisos remain in effect.
Note that this disclaimer is only in regard to the various files
comprising GoSWISH, which does NOT include the SWISH executable(s) --
see the GNU license for information on distribution of SWISH.
Many kudos to Christopher McRae (christopher.mcrae at mq dot edu dot au) who
ported the version 1.3 C source code (from Berkeley) to OS/2, and
who generously created rxSWISH.DLL.
Also thanks to Stewart Buckingham (stu at mailroom dot com) who bravely
stepped up to the beta testing plate.