xml.pipeline
Class LinkFilter

java.lang.Object
  |
  +--xml.pipeline.EventFilter
        |
        +--xml.pipeline.LinkFilter
All Implemented Interfaces:
ContentHandler, DeclHandler, DTDHandler, EventConsumer, LexicalHandler

public class LinkFilter
extends EventFilter

Pipeline filter to remember (X)HTML links found in an (X)HTML document, so they can later be crawled. Fragments are not counted, and duplicates are ignored. Callers are responsible for filtering out URLs they aren't interested in. Events are passed through unmodified.

Input MUST include a setDocumentLocator() call, as it's used to resolve relative links in the absence of a "base" element. Input MUST also include namespace identifiers, since it is the XHTML namespace identifier which is used to identify the relevant elements.

This works reasonably well with "dirty" XHTML events (not necessarily very close to valid) such as those that may be generated by the HtmlParser which wraps the Swing HTML 3.2 parser, since it only uses the elements and their attributes rather than any structural information.

In the future this will be a natural place to use the new xml:base attribute ... in association with a stack of such. Similarly, this should later recognize (convert?) XLink data.

Version:
$Date: 2000/07/15 00:56:58 $
Author:
David Brownell

Fields inherited from class xml.pipeline.EventFilter
HANDLER_URI
 
Constructor Summary
LinkFilter()
          Constructs a new event filter, which collects links in private data structure for later enumeration.
LinkFilter(EventConsumer next)
          Constructs a new event filter, which collects links in private data structure for later enumeration and passes all events, unmodified, to the next consumer.
 
Method Summary
 void endDocument()
          Forgets about any base URI information that may be recorded.
 java.util.Enumeration getLinks()
          Returns an enumeration of the links found since the filter was constructed, or since removeAllLinks() was called.
 void removeAllLinks()
          Removes records about all links reported to the event stream, as if the filter were newly created.
 void setDocumentLocator(Locator l)
          Used to resole relative links, as the base URI for the document.
 void startElement(java.lang.String namespace, java.lang.String local, java.lang.String name, Attributes attrs)
          Collects URIs for (X)HTML content from elements which hold them.
 
Methods inherited from class xml.pipeline.EventFilter
attributeDecl, characters, comment, elementDecl, endCDATA, endDTD, endElement, endEntity, endPrefixMapping, externalEntityDecl, getContentHandler, getDTDHandler, getErrorHandler, getNext, getProperty, ignorableWhitespace, internalEntityDecl, notationDecl, processingInstruction, setContentHandler, setDTDHandler, setErrorHandler, setProperty, skippedEntity, startCDATA, startDocument, startDTD, startEntity, startPrefixMapping, unparsedEntityDecl
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LinkFilter

public LinkFilter()
Constructs a new event filter, which collects links in private data structure for later enumeration.

LinkFilter

public LinkFilter(EventConsumer next)
Constructs a new event filter, which collects links in private data structure for later enumeration and passes all events, unmodified, to the next consumer.
Method Detail

getLinks

public java.util.Enumeration getLinks()
Returns an enumeration of the links found since the filter was constructed, or since removeAllLinks() was called.
Returns:
enumeration of strings.

removeAllLinks

public void removeAllLinks()
Removes records about all links reported to the event stream, as if the filter were newly created.

setDocumentLocator

public void setDocumentLocator(Locator l)
Used to resole relative links, as the base URI for the document.
Overrides:
setDocumentLocator in class EventFilter
Following copied from interface: org.xml.sax.ContentHandler
Parameters:
locator - An object that can return the location of any SAX document event.
See Also:
Locator

startElement

public void startElement(java.lang.String namespace,
                         java.lang.String local,
                         java.lang.String name,
                         Attributes attrs)
                  throws SAXException
Collects URIs for (X)HTML content from elements which hold them.
Overrides:
startElement in class EventFilter
Following copied from interface: org.xml.sax.ContentHandler
Parameters:
uri - The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
localName - The local name (without prefix), or the empty string if Namespace processing is not being performed.
qName - The qualified name (with prefix), or the empty string if qualified names are not available.
atts - The attributes attached to the element. If there are no attributes, it shall be an empty Attributes object.
Throws:
SAXException - Any SAX exception, possibly wrapping another exception.
See Also:
ContentHandler.endElement(java.lang.String, java.lang.String, java.lang.String), Attributes

endDocument

public void endDocument()
                 throws SAXException
Forgets about any base URI information that may be recorded. Applications will often want to call removeAllLinks(), likely after examining the links which were reported.
Overrides:
endDocument in class EventFilter
Following copied from interface: org.xml.sax.ContentHandler
Throws:
SAXException - Any SAX exception, possibly wrapping another exception.
See Also:
ContentHandler.startDocument()

Source code is GPL'd at http://xmlconf.sourceforge.net.