|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--org.brownell.xml.HtmlParser
This is a wrapper around the javax.swing.text.html.parser.* HTML parser, implementing the SAX2 interfaces. On valid HTML, much invalid or malformed HTML, and compatible XHTML, it produces a stream of SAX parsing events corresponding to the parse of the corresponding (well formed and often more valid) XHTML document. Element and attribute names are uniformly presented in lower case, and belonging to the XHTML namespace.
Only one type of lexical event is reported: comments are visible. This is generally used with HTML to access inlined CSS comments which are protected against browsers old enough that they don't understand what the "style" tag means. Expansions of built-in entities (such as " ") or character references are accordingly not visible.
This parser does not support dynamic modification of the input stream to the parser, needed to fully support <script> tags which use the DOM to splice new page content into documents as they load.
Current (Swing 1.1.1) HTML parsing issues without included workarounds include:
This assigns all elements and attributes, except those known to be in the XML namespace, to the XHTML namespace. It also reports a default prefix mapping for that namespace. The overall result is intended to be that this produces XHTML which is as valid as the input HTML, given an appropriate doctype declaration. Achievement of that goal may be limited by problems in the Swing HTML parser, as noted above.
This driver adds ignorable newlines at various locations where they won't be confused with HTML content. These may of course be ignored. If they are not ignored, they make the output of this parser be more easily printed, since otherwise HTML files of all sizes will appear without line breaks of any kind, and viewing the output of this parser will cause trouble for most text editors.
There are also various undocumented, or poorly documented, behaviors of the Swing parser. It adds an illegal <__EndOfLineTag__> element after the root element, for example. These are ignored as well as possible, given the all but complete lack of specification for the Swing parser callbacks.
Constructor Summary | |
HtmlParser()
Constructs a new HTML parser. |
Method Summary | |
ContentHandler |
getContentHandler()
SAX2: Returns the object used to report the logical content of an XML document. |
DTDHandler |
getDTDHandler()
SAX2: Returns the object used to process declarations related to notations and unparsed entities. |
EntityResolver |
getEntityResolver()
SAX2: Returns the object used when resolving external entities during parsing (both general and parameter entities). |
ErrorHandler |
getErrorHandler()
SAX2: Returns the object used to receive callbacks for XML errors of all levels (fatal, nonfatal, warning). |
boolean |
getFeature(java.lang.String featureId)
SAX2: Tells whether this parser supports the specified feature. |
java.lang.Object |
getProperty(java.lang.String propertyId)
SAX2: Returns the specified property. |
void |
parse(InputSource input)
SAX1: parse the HTML text in the given input source. |
void |
parse(java.lang.String uri)
SAX1: Parse the HTML text at the given input URI. |
void |
setContentHandler(ContentHandler handler)
SAX2: Assigns the object used to report the logical content of an XML document. |
void |
setDTDHandler(DTDHandler handler)
SAX1: Provides an object which may be used to intercept declarations related to notations and unparsed entities. |
void |
setEntityResolver(EntityResolver resolver)
SAX1: Provides an object which may be used when resolving external entities during parsing (both general and parameter entities). |
void |
setErrorHandler(ErrorHandler handler)
SAX1: Provides an object which receives callbacks for HTML errors of all levels (fatal, nonfatal, warning). |
void |
setFeature(java.lang.String featureId,
boolean state)
SAX2: Sets the state of features supported in this parser. |
void |
setLocale(java.util.Locale locale)
SAX1: Identifies the locale which the parser should use for the diagnostics it provides. |
void |
setProperty(java.lang.String propertyId,
java.lang.Object property)
SAX2: Assigns the specified property. |
Methods inherited from class java.lang.Object |
clone,
equals,
finalize,
getClass,
hashCode,
notify,
notifyAll,
toString,
wait,
wait,
wait |
Constructor Detail |
public HtmlParser()
Method Detail |
public ErrorHandler getErrorHandler()
XMLReader.setErrorHandler(org.xml.sax.ErrorHandler)
public void setErrorHandler(ErrorHandler handler)
Note that this parser does not provide a consistent categorization of errors according to the categories defined in the SAX API. Most problems are reported at the "warning" level, and even those few validity related errors reported at the "nonfatal" level may not be viewed as issues in all HTML environments. No errors are reported as "fatal".
Throwing an exception from an error handler may not work well.
handler
- The error handler.XMLReader.getErrorHandler()
public DTDHandler getDTDHandler()
XMLReader.setDTDHandler(org.xml.sax.DTDHandler)
public void setDTDHandler(DTDHandler handler)
Not used by this parser.
handler
- The DTD handler.XMLReader.getDTDHandler()
public EntityResolver getEntityResolver()
XMLReader.setEntityResolver(org.xml.sax.EntityResolver)
public void setEntityResolver(EntityResolver resolver)
Not used by this parser.
resolver
- The entity resolver.XMLReader.getEntityResolver()
public ContentHandler getContentHandler()
XMLReader.setContentHandler(org.xml.sax.ContentHandler)
public void setContentHandler(ContentHandler handler)
handler
- The content handler.XMLReader.getContentHandler()
public void setLocale(java.util.Locale locale) throws SAXException
Not used by this parser.
public void parse(InputSource input) throws SAXException, java.io.IOException
public void parse(java.lang.String uri) throws SAXException, java.io.IOException
systemId
- The system identifier (URI).XMLReader.parse(org.xml.sax.InputSource)
public boolean getFeature(java.lang.String featureId) throws SAXNotRecognizedException, SAXNotSupportedException
name
- The feature name, which is a fully-qualified URI.XMLReader.setFeature(java.lang.String, boolean)
public java.lang.Object getProperty(java.lang.String propertyId) throws SAXNotRecognizedException, SAXNotSupportedException
name
- The property name, which is a fully-qualified URI.XMLReader.setProperty(java.lang.String, java.lang.Object)
public void setFeature(java.lang.String featureId, boolean state) throws SAXNotRecognizedException, SAXNotSupportedException
name
- The feature name, which is a fully-qualified URI.state
- The requested state of the feature (true or false).XMLReader.getFeature(java.lang.String)
public void setProperty(java.lang.String propertyId, java.lang.Object property) throws SAXNotRecognizedException, SAXNotSupportedException
name
- The property name, which is a fully-qualified URI.state
- The requested value for the property.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |