|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--org.brownell.xml.pipeline.EventFilter | +--org.brownell.xml.pipeline.LinkFilter
Pipeline filter to remember (X)HTML links found in an (X)HTML document, so they can later be crawled. Fragments are not counted, and duplicates are ignored. Callers are responsible for filtering out URLs they aren't interested in. Events are passed through unmodified.
Input MUST include a setDocumentLocator() call, as it's used to resolve relative links in the absence of a "base" element. Input MUST also include namespace identifiers, since it is the XHTML namespace identifier which is used to identify the relevant elements.
This works reasonably well with "dirty" XHTML events (not necessarily
very close to valid) such as those that may be generated by the
HtmlParser
which wraps the Swing
HTML 3.2 parser, since it only uses the elements and their attributes
rather than any structural information.
In the future this will be a natural place to use the new xml:base attribute ... in association with a stack of such. Similarly, this should later recognize (convert?) XLink data.
Fields inherited from class org.brownell.xml.pipeline.EventFilter |
HANDLER_URI |
Constructor Summary | |
LinkFilter()
Constructs a new event filter, which collects links in private data structure for later enumeration. |
|
LinkFilter(EventConsumer next)
Constructs a new event filter, which collects links in private data structure for later enumeration and passes all events, unmodified, to the next consumer. |
Method Summary | |
void |
endDocument()
Forgets about any base URI information that may be recorded. |
java.util.Enumeration |
getLinks()
Returns an enumeration of the links found since the filter was constructed, or since removeAllLinks() was called. |
void |
removeAllLinks()
Removes records about all links reported to the event stream, as if the filter were newly created. |
void |
setDocumentLocator(Locator l)
Used to resole relative links, as the base URI for the document. |
void |
startElement(java.lang.String namespace,
java.lang.String local,
java.lang.String name,
Attributes attrs)
Collects URIs for (X)HTML content from elements which hold them. |
Methods inherited from class org.brownell.xml.pipeline.EventFilter |
attributeDecl,
characters,
comment,
elementDecl,
endCDATA,
endDTD,
endElement,
endEntity,
endPrefixMapping,
externalEntityDecl,
getContentHandler,
getDTDHandler,
getErrorHandler,
getNext,
getProperty,
ignorableWhitespace,
internalEntityDecl,
notationDecl,
processingInstruction,
setContentHandler,
setDTDHandler,
setErrorHandler,
setProperty,
skippedEntity,
startCDATA,
startDocument,
startDTD,
startEntity,
startPrefixMapping,
unparsedEntityDecl |
Methods inherited from class java.lang.Object |
clone,
equals,
finalize,
getClass,
hashCode,
notify,
notifyAll,
toString,
wait,
wait,
wait |
Constructor Detail |
public LinkFilter()
public LinkFilter(EventConsumer next)
Method Detail |
public java.util.Enumeration getLinks()
public void removeAllLinks()
public void setDocumentLocator(Locator l)
locator
- An object that can return the location of
any SAX document event.Locator
public void startElement(java.lang.String namespace, java.lang.String local, java.lang.String name, Attributes attrs) throws SAXException
uri
- The Namespace URI, or the empty string if the
element has no Namespace URI or if Namespace
processing is not being performed.localName
- The local name (without prefix), or the
empty string if Namespace processing is not being
performed.qName
- The qualified name (with prefix), or the
empty string if qualified names are not available.atts
- The attributes attached to the element. If
there are no attributes, it shall be an empty
Attributes object.ContentHandler.endElement(java.lang.String, java.lang.String, java.lang.String)
,
Attributes
public void endDocument() throws SAXException
ContentHandler.startDocument()
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |