15 May 1998: The CheckLink addon for SRE-http, Version 1.02 Contact: Daniel Hellerstein (danielh@econ.ag.gov) CheckLink: Create, display,traverse,and index a web-tree Abstract: CheckLink is a multi-threaded, socket aware utility used to create, verify, traverse, and index a web-tree; where "web-tree" is defined as all URL's (in-line images, anchors, etc.) that are referenced in a root HTML document, and in all documents reachable from this root. CheckLink can be run as an SRE-http addon, as a CGI-bin script, or from an OS/2 command prompt. ------------------- Contents: 1. Introduction 1.a. Web Tree? Does that make sense? II. Installation II.a. Using CheckLink without SRE-http. III. CheckLink parameters. III.a. A Note on How CHEKLINK displays results III.b. CHEKLINK, CHEKLNK2, and CHEKINDX parameters IV. CheckLink Request Options IV.a. CHEKLINK options. IV.b. CHEKLNK2 options IV.c. CHEKINDX IV.c.ii. CHEKINDX options. IV.c.iii. CHEKINDX edit mode V. Notes VI. Disclaimer ------------------- I. Introduction CheckLink is a robot that is used to create, verify, traverse and index a web-tree. In other words, CheckLink will find and variously display all the URLS (such as anchors and in-line images) that appear in a set of HTML documents. In particular, CheckLink will: ... given a "starter-URL" provided by a client: a) use TCP/IP socket calls to obtain the contents of the html document (that this "starter-URL" points to) b) find URLs referred to by this document (i.e. cd \internet\cheklink D:\INTERNET\CHEKLINK>cheklink When run in standalone mode, the i/o interface is primitive, and the final output is HTML code -- it is meant to be viewed with a browser. Otherwise, the results are the same as when run as an SRE-http addon (it might even be a touch faster). IMPORTANT NOTE: To use CheckLink as a standalone program, you MUST have REXXLIB.DLL. REXXLIB is $25 shareware (obtainable from http://www.quercus-sys.com/rexxlib.htm). It's a good bargain, but if this expenditure is problemmatic, please contact danielh@econ.ag.gov for alternatives. If you want to use CheckLink to "examine and traverse the web-tree", or to "create an index of the web-tree", you should copy the CHEKLNK2.CMD and CHEKINDX.CMD files to your CGI-BIN scripts directory. The output from CHEKLINK will contain CGI-BIN calls to CHEKLNK2. Thus ... To use CheckLink in a non-SRE-http environment, you will a) Run CHEKLINK.CMD, from an OS/2 command prompt, to generate the index of a web-tree, and to produce several tables of results. b) Run CHEKLNK2.CMD and CHEKINDX.CMD as CGI-BIN scripts c) Optionally, make a few small CGI-BIN modifications to CHEKLINK.HTM (see CHEKLINK.HTM for the details) ------------------- III. CheckLink parameters. Regardless of how you run CheckLink, you may wish to first adjust several performance-tuning and display-customization parameters. Most of these appear at the top of the CHEKLINK.CMD, and there are a few in CHEKLNK2.CMD and CHEKINDX.CMD -- you should modify these files with your favorite text editor. Note that to use any of the 3 CheckLink programs you do NOT need to set these parameters -- the default values work reasonably well. However, if you intend to make more then occasional use of CheckLink, we recommend setting the LINKFILE_DIR parameter in CHEKLINK.CMD, CHEKLNK2.CMD, and CHEKINDX.CMD. ------------------- III.a. A Note on How CHEKLINK displays results Before further discussion, a note on how CHEKLINK displays results (when run as an SRE-http addon) is germane: CHEKLINK can return results either in one long document, as a "two part" document, or in two seperate documents. In a "two part" document: The first part contains status information, and is sent to the client in pieces. The second part contains the results tables. In a "long document" these parts are concatenated -- the final output contains both "status" and "results" information (and will be a bit more cluttered as a result) Since CHEKLINK can take several minutes to process a thousand or so links, the production of "status" information is crucial. In fact, this status information is "sent in pieces" -- with some sort of output being sent to the client every few seconds. Not only does this help keep the client from giving up, it also prevents "server inactive" timeouts. In fact, it's this "may take several minutes to finish" aspect of CHEKLINK that makes it very difficult to distribute a pure CGI-BIN version of CHEKLINK -- most CGI-BIN implementations do NOT allow for "sending results as they become avaialble", and one can not count on lengthy (i.e.; more then a few minutes) inactive-timeouts. Although two-part documents are the more elegant solution, with certain browsers some very annoying "over refresh" behavior occurs (i.e; every time you "back up" to the results, CHEKLINK is reinvoked). As a work around, the "two document" strategy can be used, which will result in almost the same display as a two-part document (client pull is used to automatically replace the "status" document with the "results" document). The drawback is the requirement for semi-permanent storage of the results file on your server's disk -- you may need to monitor disk space if you allow CHEKLINK to be extensively used in two-document mode. ------------------- III.b. CHEKLINK, CHEKLNK2, and CHEKINDX parameters: BACK_1 : modifiers. BACK_2 BACK_1 and BACK_2 are used to set a BGCOLOR (or BACKGROUND) for the "two parts" of CheckLink's output. Note that if you are using CheckLink in single-part mode (i.e.; if you are using an older web browser, or if you set the multi_use option to 0) BACK_2 is ignored. Examples: back_1='bgcolor="#668a78"' back_2='bgcolor="#8888dd" background="CL.GIF' Note: BACK_1 (BACK_2) is ignored if INTRO_1A (INTRO_1B) is set to a non-null value. CHEKLINK_HTM : URL pointing to CHEKLINK.HTM CHEKLINK_HTM should contain a URL (usually, a relative URL) that points to the CHEKLINK.HTM file shipped with CheckLink. This variable is used to add a "generate another web-tree" option to the output file. Thus, neglecting to properly set CHEKLINK_HTM will have few deleterious effects. Example: CHEKLINK_HTM = '/CHEKLINK.HTM' CHECK_ROBOT : Suppress checking ROBOTS.TXT. If check_robot=1, then check the starter-URL site for a /robots.txt file, and use it to control extent of search. Proper net'iquette dicates that when checking a stranger's site, make sure you have set check_robot=1. Note: the contents of a ROBOTS.TXT file are added to the a special "site-specific" EXCLUSION_LIST -- it only effects URLs on the starter-URL site. Example: check_robot=1 DOUBLE_CHECK: Since servers can be momentarily busy, it's often wise to "double check" busy servers. To do this, set DOUBLE_CHECK=1 To NOT double check, set DOUBLE_CHECK=0. This double checking will only look at servers that were "not available". It will be done after all links have been examined (thus giving the "not available" server a chance to become available. Lastly, GET queries are used (instead of HEAD queries). GET_QUERY: As part of mapping a web-tree, CheckLink will query servers for basic information on URLs. These queries are best done with HEAD requests. Unfortunately, there are a number of older servers that do not properly respond to HEAD requests.If you find that CheckLink is identifiying many URLs as unavailable (even though your browser can get to them readily), it may be due to their host server's failure to recognize these HEAD requests. As a work around, you can use short GET requests instead of HEAD requests. This method is engaged by setting: get_query=1. Example: get_query=0 Note: This get_query=1 method is not highly recommended -- it's slower, and somewhat "ruder" (connections are purposely broken, which tends to add garbage to the visited server's log file). Instead, we recommend setting DOUBLE_CHECK=1 LINKFILE_DIR: directory to store "linkage" files in. Linkage files contain "link" information on all the URLs discovered during CheckLink's recursive mapping of a "web tree". In particular, the LINKFILE option (see section IV) specifies a filename, which will then be stored in the LINKFILE_DIR. By default, LINKFILE_DIR will be your OS/2 TEMP drive. Example: LINKFILE_DIR='D:\GOSERVE\CHKLNKS' Note: in addition to storing LINKFILEs, the LINKFILE_DIR is also used to store "RESULTS" files. MAXATONCE: maximum number of "query" threads Specifies the maximum number of threads to use when checking for the existence (and mimetype) of a link (using HEAD requests). Increasing this number may speed up throughput, but it may subject the target server(s) to excessive loads. Example: maxatonce=6 MAXATONCE_GET: maximum number of "read" threads. Specifies the maximum number of threads to use when retrieving the contents of a URL (using GET requests). Increasing this number may speed up throughput, but it may subject the target server(s) to excessive loads. Example: maxatonce_get=2 MAXAGE: Kill a query if it's old Specifies number of seconds to wait on a query (a HEAD request). You may need to increase this time span if sites are far away or otherwise slow. However, increasing MAXAGE will increase the time that CheckLink waits on "hung" sites. Example:maxage=30 MAXAGE2: Kill a read if it's old Specifies number of seconds to wait on a read (a GET request). You may need to increase this time span if sites are far away or otherwise slow. However, increasing MAXAGE will increase the time that CheckLink waits on "hung" sites. Example:maxage2=60 ROW_COLOR1 : Used to set the in the results tables ROW_COLOR2 ROW_COLOR1A ROW_COLOR2A ROW_COLOR1 and ROW_COLOR2 set the odd and even rows (respectively) of tables used to display the results of checking IMG links. ROW_COLOR1A and ROW_COLOR2A set the odd and even rows (respectively) of tables used to display the results of checking Anchor links. Examples: row_color1='bgcolor="#bbcc66"' row_color2='bgcolor="#aaccdd"' row_color1a='bgcolor="#bbaa44"' row_color2a='bgcolor="#aaccdd"' TD_INDENT: Used to Indent a Type=2 (table) index This string is used to indent each row of an "index table". You can try using characters (i.e.; ___ ), none-breaking spaces (i.e.;  ) or empty columns (i.e.; ) Example: td_indent=' __' Special Feature: If you have a 1_PIXEL.GIF in the /IMGS/ directory of your web tree, you can set td_indent equal to an integer value, which will cause the following to be used: where 45 is any number equal to td_indent * indent_levels. The above example would be used if > td_indent=15, > a given line is being displayed at a third level indentation Since td_indent*3 == 15*3 == 45; a 45 pixel "blank" spacer-image will be drawn NOte: 1_PIXEL.GIF should be a GIF file consisting of 1 transparent pixel. TD_TITLE: : Modifies "title" field of a table index. TD_TITLE is used in the ....

...

Note: use of user_intro1a (user_intro1b) means that back_1 (back_2) are NOT used. Examples: user_intro1a='' user_intro1b='D:\GOSERVE\CHEK1.HDR' ------------------- IV. CheckLink Request Options Request options are specified when one of the CheckLink programs is requested; say, when you use CHEKLINK as the ACTION in an HTML FORM. The following briefly describe these options. For further details, we recommend perusing CHEKLINK.HTM. ------------------- IV.a. CHEKLINK options. The only required option is URL (defaults will be used for the other options when they are not specified). Options: BASEONLY : BASEONLY=0 : Read url's relative to the root of the request BASEONLY=1 : Read url's relative to the base of the request Example: if URL=/dogs/foo.htm; then baseonly=0 : /cats/bar.htm would be "recursively" read baseonly=1 : /cats/bar.htm would NOT "recursively" read DESCRIP: Create & save descriptions for "on-site" (and "in directory", if BASEONLY=1) documnents. DESCRIP=0 -- do not create descriptions DESCRIP=1 -- create descriptions for text/html documents DESCRIP=2 -- create descriptions for text/html and text/plain documents DESCRIP=1 is fairly costless (it uses information that's already been read). DESCRIP=2 requires reading additional files. A maximum of 300 characters is retained (this can be modified by changing the DSCMAX parameter in CHEKLINK.CMD). EXCLUSION_LIST: Space delimited list of selector to NOT query or read. *'s can be used as wildcards. Example:!* *?* *MAPIMAGE/* CGI-*' (this is also the default) LINKFILE : Name of a file to store "linkage" information. Linkage information pertains to each and every URL in the web-tree. Each of these URLs will be associated with a list of web-tree residing, text/html, URLs that contain links pointing to this URL. In addition, each text/html URL (in the web tree) is associated with a list of all it's links (that point both on and off site) The LINKFILE is used to store these lists. More importantly, CHEKLNK2.CMD uses the LINKFILE to "examine and traverse" the web tree. Notes: * The LINKFILE should be a file name, without path or extension information. A default extension of .STM is used, and the file is written to the LINKFILE_DIR directory. * If you do not want to retain this information, set LINKFILE=0 * If you set LINKFILE (to a non-0 value), the output from CHEKLINK will contain links (one for each URL) to CHEKLNK2. NAME: A descriptive name You can enter a descriptive name for this "web-tree" -- it will be displayed at various points. If you do not specify a name, a default name will be constructed from the URL option (see below). Example: name=A+Sample+web_tree (note the URL encoding of spaces as + characters) OUTTYPE: A space delimited list of tables to produce. The following values can be used in any combinaton: OK ) Display succesfully found links NOSITE ) Display links to unreachable sites NOURL ) Display links missing resources> OFFSITE ) Display links to off-site URLs EXCLUDED ) Display links to excluded URLs (as specified in the EXCLUSION_LIST) ALL ) Display all links Examples: OUTTYPE='ALL' OUTTYPE='OK NOURL ' RESULTS : A file containing the results of a prior call to CheckLink (primarily for internal use by CheckLink). Due to inappropriate refreshing by certain browsers, CheckLink can be instructed to save it's results tables to a file (see description of use_multi). RESULTS points to one of these files -- when included, CheckLink will just return the RESULTS file. Example: results="CHKS0001.HTM" Note that these "results" files are stored in the LINKFILE_DIR directory. SITEONLY: SITEONLY=0 : Query all url's SITEONLY=1 : Query url's on starter-URL's "own site" URL: URL=fully qualified, or relative, URL This is the "starter-URL" Example: url="/samples/guide.htm" USE_MULTI: USE_MULTI=0 : Return results in one long documemt USE_MULTI=1 : Return results in two-part document; with the second part replacing (overwriting) the first. USE_MULTI=2 : Return results in two seperate documents, the second one being stored on the server's disk. Note that if an older browser (that does not support connection:maintain) is used, then USE_MULTI is set to 2. The primary reason for USE_MULTI=2 is to work around the "over- refreshing" bugs of certain browsers. Note that when USE_MULTI=2 is used, the RESULTS option is used internally by CHEKLINK to provide a link to the second document. This document, which will be assigned random name, will be stored on the LINKFILE_DIR directory. ------------------- IV.b. CHEKLNK2 options CHEKLNK2 is used to examine and traverse a web tree. Typically, you would not code a requeset to CHEKLNK2 -- you'ld use links to CHEKLNK2 in the table produced by CHEKLINK. In addition, CHEKLNK2 includes numerous links back into CHEKLNK2, links that utilize the options listed below. That is, CHEKLNK2 is somewhat of a self-contained program. It is NOT expected that expected that CHEKLNK2 will be explicitily used by most authors. Therefore -- the following description will be rudimentary. Note that CHEKLNK2 can be called as an SRE-http addon, or as a CGI-BIN script (but not as a standalone program). Options: LINKFILE -- Same definition as above -- the linkage file (relative to the LINKFILE_DIR directory) that was created by a request to CHEKLINK. ENTRYNUM -- pointer to an entry in the LINKFILE -- his entry corresponds to a unique URL; CHEKLNK2 will display links to and from this unique URL. Example:entrynum=12 If entrynum=0, an alphabetized index of all text/html documents (in the web-tree) will be displayed. ISIMG -- Select between image & anchors linkss. Setting isimg=1 means to use "image" links; otherwise, use "anchor" links. Note that the the combination of ENTRYNUM and ISIMG dictate which URL will be examined. Example: entrynum=15&isimg=1 VIA -- Information on what location in the web-tree (which URL) was being examined prior to jumping here. LIST -- Enable "traverse web tree mode". LIST can take the following values: LIST=0 (the default (used if LIST is not specified). Display a "synopsis" of the URL. This synopsis includes basic information (such as the size and mime type), and a list of URLs (in the web tree) that refer to text/html documents that contain links to this URL (the entrynum URL). In addition, if this (the entrynum) URL is a text/html document, a table of all links (images and anchors) will be displayed LIST=1 Display an (alphabetized) list of links to all text/html documents pointed to by links in the "entrynum URL" (more precisely, by the text/html document pointed to by the entrynum URL). LIST=2 Similar to LIST=1, but display text/html documents that point TO the "entrynum URL" (LIST=2 is the reverse of LIST=1) LIST=3 Display an alphabetized table of ALL urls contained in web-tree. In contrast, using LIST=0 and ENTRYNUM=0 will generate a list of "on-site, text/html documents". Example: LIST=1&entrynum=5 MIME -- A space delimited list of mimetypes, possibly containing wildcards. MIME is only used when LIST=3. When you specify MIME, then only URLs with a mimetype matching (one of) the elements of the MIME value will be used. Examples: LIST=3&MIME=text/plain LIST=3&MIME=image/* LIST=3&MIME=application/pdf+application/x-pdf (note use of + as a url encoded space) Special Note: If you include an * in the LINKFILE value, CHEKLNK2 will produce a short list of currently available linkage files, and let you choose one to examine. The choice uses normal file matching rules. For example /CHEKLNK2?linkfile='CHK*' may yield CHK01, CHKNOW, and CHK_C. ------------------- IV.c. CHEKINDX CHEKINDX is used to create a hierarchical index of your web-tree. By hierarchical index, we mean the sort of index we are all familiar with -- a highly indented list, with more "subsidiary" resources on more indented lines. Basically, the notion is to use CHEKINDX to create a "web index" that you can post on your site (usually with suitable prettifications). Note that CHEKINDX uses nested "unordered lists" (