org.brownell.xml
Class HtmlParser

java.lang.Object
  |
  +--org.brownell.xml.HtmlParser

public final class HtmlParser
extends java.lang.Object
implements XMLReader

This is a wrapper around the javax.swing.text.html.parser.* HTML parser, implementing the SAX2 interfaces. On valid HTML, much invalid or malformed HTML, and compatible XHTML, it produces a stream of SAX parsing events corresponding to the parse of the corresponding (well formed and often more valid) XHTML document. Element and attribute names are uniformly presented in lower case, and belonging to the XHTML namespace.

Only one type of lexical event is reported: comments are visible. This is generally used with HTML to access inlined CSS comments which are protected against browsers old enough that they don't understand what the "style" tag means. Expansions of built-in entities (such as " ") or character references are accordingly not visible.

This parser does not support dynamic modification of the input stream to the parser, needed to fully support <script> tags which use the DOM to splice new page content into documents as they load.

Current (Swing 1.1.1) HTML parsing issues without included workarounds include:

This assigns all elements and attributes, except those known to be in the XML namespace, to the XHTML namespace. It also reports a default prefix mapping for that namespace. The overall result is intended to be that this produces XHTML which is as valid as the input HTML, given an appropriate doctype declaration. Achievement of that goal may be limited by problems in the Swing HTML parser, as noted above.

This driver adds ignorable newlines at various locations where they won't be confused with HTML content. These may of course be ignored. If they are not ignored, they make the output of this parser be more easily printed, since otherwise HTML files of all sizes will appear without line breaks of any kind, and viewing the output of this parser will cause trouble for most text editors.

There are also various undocumented, or poorly documented, behaviors of the Swing parser. It adds an illegal <__EndOfLineTag__> element after the root element, for example. These are ignored as well as possible, given the all but complete lack of specification for the Swing parser callbacks.

Version:
$Date: 2000/05/29 12:12:04 $
Author:
David Brownell

Constructor Summary
HtmlParser()
          Constructs a new HTML parser.
 
Method Summary
 ContentHandler getContentHandler()
          SAX2: Returns the object used to report the logical content of an XML document.
 DTDHandler getDTDHandler()
          SAX2: Returns the object used to process declarations related to notations and unparsed entities.
 EntityResolver getEntityResolver()
          SAX2: Returns the object used when resolving external entities during parsing (both general and parameter entities).
 ErrorHandler getErrorHandler()
          SAX2: Returns the object used to receive callbacks for XML errors of all levels (fatal, nonfatal, warning).
 boolean getFeature(java.lang.String featureId)
          SAX2: Tells whether this parser supports the specified feature.
 java.lang.Object getProperty(java.lang.String propertyId)
          SAX2: Returns the specified property.
 void parse(InputSource input)
          SAX1: parse the HTML text in the given input source.
 void parse(java.lang.String uri)
          SAX1: Parse the HTML text at the given input URI.
 void setContentHandler(ContentHandler handler)
          SAX2: Assigns the object used to report the logical content of an XML document.
 void setDTDHandler(DTDHandler handler)
          SAX1: Provides an object which may be used to intercept declarations related to notations and unparsed entities.
 void setEntityResolver(EntityResolver resolver)
          SAX1: Provides an object which may be used when resolving external entities during parsing (both general and parameter entities).
 void setErrorHandler(ErrorHandler handler)
          SAX1: Provides an object which receives callbacks for HTML errors of all levels (fatal, nonfatal, warning).
 void setFeature(java.lang.String featureId, boolean state)
          SAX2: Sets the state of features supported in this parser.
 void setLocale(java.util.Locale locale)
          SAX1: Identifies the locale which the parser should use for the diagnostics it provides.
 void setProperty(java.lang.String propertyId, java.lang.Object property)
          SAX2: Assigns the specified property.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HtmlParser

public HtmlParser()
Constructs a new HTML parser.
Method Detail

getErrorHandler

public ErrorHandler getErrorHandler()
SAX2: Returns the object used to receive callbacks for XML errors of all levels (fatal, nonfatal, warning).
Specified by:
getErrorHandler in interface XMLReader
Tags copied from interface: XMLReader
Returns:
The current error handler, or null if none has been registered.
See Also:
XMLReader.setErrorHandler(org.xml.sax.ErrorHandler)

setErrorHandler

public void setErrorHandler(ErrorHandler handler)
SAX1: Provides an object which receives callbacks for HTML errors of all levels (fatal, nonfatal, warning).

Note that this parser does not provide a consistent categorization of errors according to the categories defined in the SAX API. Most problems are reported at the "warning" level, and even those few validity related errors reported at the "nonfatal" level may not be viewed as issues in all HTML environments. No errors are reported as "fatal".

Throwing an exception from an error handler may not work well.

Specified by:
setErrorHandler in interface XMLReader
Tags copied from interface: XMLReader
Parameters:
handler - The error handler.
Throws:
java.lang.NullPointerException - If the handler argument is null.
See Also:
XMLReader.getErrorHandler()

getDTDHandler

public DTDHandler getDTDHandler()
SAX2: Returns the object used to process declarations related to notations and unparsed entities.
Specified by:
getDTDHandler in interface XMLReader
Tags copied from interface: XMLReader
Returns:
The current DTD handler, or null if none has been registered.
See Also:
XMLReader.setDTDHandler(org.xml.sax.DTDHandler)

setDTDHandler

public void setDTDHandler(DTDHandler handler)
SAX1: Provides an object which may be used to intercept declarations related to notations and unparsed entities.

Not used by this parser.

Specified by:
setDTDHandler in interface XMLReader
Tags copied from interface: XMLReader
Parameters:
handler - The DTD handler.
Throws:
java.lang.NullPointerException - If the handler argument is null.
See Also:
XMLReader.getDTDHandler()

getEntityResolver

public EntityResolver getEntityResolver()
SAX2: Returns the object used when resolving external entities during parsing (both general and parameter entities).
Specified by:
getEntityResolver in interface XMLReader
Tags copied from interface: XMLReader
Returns:
The current entity resolver, or null if none has been registered.
See Also:
XMLReader.setEntityResolver(org.xml.sax.EntityResolver)

setEntityResolver

public void setEntityResolver(EntityResolver resolver)
SAX1: Provides an object which may be used when resolving external entities during parsing (both general and parameter entities).

Not used by this parser.

Specified by:
setEntityResolver in interface XMLReader
Tags copied from interface: XMLReader
Parameters:
resolver - The entity resolver.
Throws:
java.lang.NullPointerException - If the resolver argument is null.
See Also:
XMLReader.getEntityResolver()

getContentHandler

public ContentHandler getContentHandler()
SAX2: Returns the object used to report the logical content of an XML document.
Specified by:
getContentHandler in interface XMLReader
Tags copied from interface: XMLReader
Returns:
The current content handler, or null if none has been registered.
See Also:
XMLReader.setContentHandler(org.xml.sax.ContentHandler)

setContentHandler

public void setContentHandler(ContentHandler handler)
SAX2: Assigns the object used to report the logical content of an XML document.
Specified by:
setContentHandler in interface XMLReader
Tags copied from interface: XMLReader
Parameters:
handler - The content handler.
Throws:
java.lang.NullPointerException - If the handler argument is null.
See Also:
XMLReader.getContentHandler()

setLocale

public void setLocale(java.util.Locale locale)
               throws SAXException
SAX1: Identifies the locale which the parser should use for the diagnostics it provides.

Not used by this parser.

Throws:
SAXException - as defined in the specification for org.xml.sax.Parser.setLocale()

parse

public void parse(InputSource input)
           throws SAXException,
                  java.io.IOException
SAX1: parse the HTML text in the given input source.
Specified by:
parse in interface XMLReader
Throws:
SAXException - as defined in the specification for org.xml.sax.Parser.parse()
java.io.IOException - as defined in the specification for org.xml.sax.Parser.parse()

parse

public void parse(java.lang.String uri)
           throws SAXException,
                  java.io.IOException
SAX1: Parse the HTML text at the given input URI.
Specified by:
parse in interface XMLReader
Tags copied from interface: XMLReader
Parameters:
systemId - The system identifier (URI).
Throws:
SAXException - Any SAX exception, possibly wrapping another exception.
java.io.IOException - An IO exception from the parser, possibly from a byte stream or character stream supplied by the application.
See Also:
XMLReader.parse(org.xml.sax.InputSource)

getFeature

public boolean getFeature(java.lang.String featureId)
                   throws SAXNotRecognizedException,
                          SAXNotSupportedException
SAX2: Tells whether this parser supports the specified feature.
Specified by:
getFeature in interface XMLReader
Tags copied from interface: XMLReader
Parameters:
name - The feature name, which is a fully-qualified URI.
Returns:
The current state of the feature (true or false).
Throws:
SAXNotRecognizedException - When the XMLReader does not recognize the feature name.
SAXNotSupportedException - When the XMLReader recognizes the feature name but cannot determine its value at this time.
See Also:
XMLReader.setFeature(java.lang.String, boolean)

getProperty

public java.lang.Object getProperty(java.lang.String propertyId)
                             throws SAXNotRecognizedException,
                                    SAXNotSupportedException
SAX2: Returns the specified property. At this time only lexical handlers are supported.
Specified by:
getProperty in interface XMLReader
Tags copied from interface: XMLReader
Parameters:
name - The property name, which is a fully-qualified URI.
Returns:
The current value of the property.
Throws:
SAXNotRecognizedException - When the XMLReader does not recognize the property name.
SAXNotSupportedException - When the XMLReader recognizes the property name but cannot determine its value at this time.
See Also:
XMLReader.setProperty(java.lang.String, java.lang.Object)

setFeature

public void setFeature(java.lang.String featureId,
                       boolean state)
                throws SAXNotRecognizedException,
                       SAXNotSupportedException
SAX2: Sets the state of features supported in this parser. As of this writing, no feature's state may be changed from its default value.
Specified by:
setFeature in interface XMLReader
Tags copied from interface: XMLReader
Parameters:
name - The feature name, which is a fully-qualified URI.
state - The requested state of the feature (true or false).
Throws:
SAXNotRecognizedException - When the XMLReader does not recognize the feature name.
SAXNotSupportedException - When the XMLReader recognizes the feature name but cannot set the requested value.
See Also:
XMLReader.getFeature(java.lang.String)

setProperty

public void setProperty(java.lang.String propertyId,
                        java.lang.Object property)
                 throws SAXNotRecognizedException,
                        SAXNotSupportedException
SAX2: Assigns the specified property. At this time only lexical handlers are supported, and these must not be changed to values of the wrong type. Like SAX1 handlers, these may be changed at any time.
Specified by:
setProperty in interface XMLReader
Tags copied from interface: XMLReader
Parameters:
name - The property name, which is a fully-qualified URI.
state - The requested value for the property.
Throws:
SAXNotRecognizedException - When the XMLReader does not recognize the property name.
SAXNotSupportedException - When the XMLReader recognizes the property name but cannot set the requested value.