|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--xml.HtmlParser
This is a wrapper around the javax.swing.text.html.parser.* HTML parser, implementing the SAX2 interfaces. On valid HTML, much invalid or malformed HTML, and compatible XHTML, it produces a stream of SAX parsing events corresponding to the parse of the corresponding (well formed and often more valid) XHTML document. Element and attribute names are uniformly presented in lower case, and belonging to the XHTML namespace.
Only one type of lexical event is reported: comments are visible. This is generally used with HTML to access inlined CSS comments which are protected against browsers old enough that they don't understand what the "style" tag means. Expansions of built-in entities (such as " ") or character references are accordingly not visible.
This parser does not support dynamic modification of the input stream to the parser, needed to fully support <script> tags which use the DOM to splice new page content into documents as they load.
Current (Swing 1.1.1) HTML parsing issues without included workarounds include:
This assigns all elements and attributes, except those known to be in the XML namespace, to the XHTML namespace. It also reports a default prefix mapping for that namespace. The overall result is intended to be that this produces XHTML which is as valid as the input HTML, given an appropriate doctype declaration. Achievement of that goal may be limited by problems in the Swing HTML parser, as noted above.
This driver adds ignorable newlines at various locations where they won't be confused with HTML content. These may of course be ignored. If they are not ignored, they make the output of this parser be more easily printed, since otherwise HTML files of all sizes will appear without line breaks of any kind, and viewing the output of this parser will cause trouble for most text editors.
There are also various undocumented, or poorly documented, behaviors of the Swing parser. It adds an illegal <__EndOfLineTag__> element after the root element, for example. These are ignored as well as possible, given the all but complete lack of specification for the Swing parser callbacks.
Constructor Summary | |
HtmlParser()
Constructs a new HTML parser. |
Method Summary | |
ContentHandler |
getContentHandler()
SAX2: Returns the object used to report the logical content of an XML document. |
DTDHandler |
getDTDHandler()
SAX2: Returns the object used to process declarations related to notations and unparsed entities. |
EntityResolver |
getEntityResolver()
SAX2: Returns the object used when resolving external entities during parsing (both general and parameter entities). |
ErrorHandler |
getErrorHandler()
SAX2: Returns the object used to receive callbacks for XML errors of all levels (fatal, nonfatal, warning). |
boolean |
getFeature(java.lang.String featureId)
SAX2: Tells whether this parser supports the specified feature. |
java.lang.Object |
getProperty(java.lang.String propertyId)
SAX2: Returns the specified property. |
void |
parse(InputSource input)
SAX1: parse the HTML text in the given input source. |
void |
parse(java.lang.String uri)
SAX1: Parse the HTML text at the given input URI. |
void |
setContentHandler(ContentHandler handler)
SAX2: Assigns the object used to report the logical content of an XML document. |
void |
setDTDHandler(DTDHandler handler)
SAX1: Provides an object which may be used to intercept declarations related to notations and unparsed entities. |
void |
setEntityResolver(EntityResolver resolver)
SAX1: Provides an object which may be used when resolving external entities during parsing (both general and parameter entities). |
void |
setErrorHandler(ErrorHandler handler)
SAX1: Provides an object which receives callbacks for HTML errors of all levels (fatal, nonfatal, warning). |
void |
setFeature(java.lang.String featureId,
boolean state)
SAX2: Sets the state of features supported in this parser. |
void |
setLocale(java.util.Locale locale)
SAX1: Identifies the locale which the parser should use for the diagnostics it provides. |
void |
setProperty(java.lang.String propertyId,
java.lang.Object property)
SAX2: Assigns the specified property. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
public HtmlParser()
Method Detail |
public ErrorHandler getErrorHandler()
getErrorHandler
in interface XMLReader
org.xml.sax.XMLReader
XMLReader.setErrorHandler(org.xml.sax.ErrorHandler)
public void setErrorHandler(ErrorHandler handler)
Note that this parser does not provide a consistent categorization of errors according to the categories defined in the SAX API. Most problems are reported at the "warning" level, and even those few validity related errors reported at the "nonfatal" level may not be viewed as issues in all HTML environments. No errors are reported as "fatal".
Throwing an exception from an error handler may not work well.
setErrorHandler
in interface XMLReader
org.xml.sax.XMLReader
handler
- The error handler.java.lang.NullPointerException
- If the handler
argument is null.XMLReader.getErrorHandler()
public DTDHandler getDTDHandler()
getDTDHandler
in interface XMLReader
org.xml.sax.XMLReader
XMLReader.setDTDHandler(org.xml.sax.DTDHandler)
public void setDTDHandler(DTDHandler handler)
Not used by this parser.
setDTDHandler
in interface XMLReader
org.xml.sax.XMLReader
handler
- The DTD handler.java.lang.NullPointerException
- If the handler
argument is null.XMLReader.getDTDHandler()
public EntityResolver getEntityResolver()
getEntityResolver
in interface XMLReader
org.xml.sax.XMLReader
XMLReader.setEntityResolver(org.xml.sax.EntityResolver)
public void setEntityResolver(EntityResolver resolver)
Not used by this parser.
setEntityResolver
in interface XMLReader
org.xml.sax.XMLReader
resolver
- The entity resolver.java.lang.NullPointerException
- If the resolver
argument is null.XMLReader.getEntityResolver()
public ContentHandler getContentHandler()
getContentHandler
in interface XMLReader
org.xml.sax.XMLReader
XMLReader.setContentHandler(org.xml.sax.ContentHandler)
public void setContentHandler(ContentHandler handler)
setContentHandler
in interface XMLReader
org.xml.sax.XMLReader
handler
- The content handler.java.lang.NullPointerException
- If the handler
argument is null.XMLReader.getContentHandler()
public void setLocale(java.util.Locale locale) throws SAXException
Not used by this parser.
SAXException
- as defined in the specification for
org.xml.sax.Parser.setLocale()public void parse(InputSource input) throws SAXException, java.io.IOException
parse
in interface XMLReader
SAXException
- as defined in the specification for
org.xml.sax.Parser.parse()java.io.IOException
- as defined in the specification for
org.xml.sax.Parser.parse()public void parse(java.lang.String uri) throws SAXException, java.io.IOException
parse
in interface XMLReader
org.xml.sax.XMLReader
systemId
- The system identifier (URI).SAXException
- Any SAX exception, possibly
wrapping another exception.java.io.IOException
- An IO exception from the parser,
possibly from a byte stream or character stream
supplied by the application.XMLReader.parse(org.xml.sax.InputSource)
public boolean getFeature(java.lang.String featureId) throws SAXNotRecognizedException, SAXNotSupportedException
getFeature
in interface XMLReader
org.xml.sax.XMLReader
name
- The feature name, which is a fully-qualified URI.SAXNotRecognizedException
- When the
XMLReader does not recognize the feature name.SAXNotSupportedException
- When the
XMLReader recognizes the feature name but
cannot determine its value at this time.XMLReader.setFeature(java.lang.String, boolean)
public java.lang.Object getProperty(java.lang.String propertyId) throws SAXNotRecognizedException, SAXNotSupportedException
getProperty
in interface XMLReader
org.xml.sax.XMLReader
name
- The property name, which is a fully-qualified URI.SAXNotRecognizedException
- When the
XMLReader does not recognize the property name.SAXNotSupportedException
- When the
XMLReader recognizes the property name but
cannot determine its value at this time.XMLReader.setProperty(java.lang.String, java.lang.Object)
public void setFeature(java.lang.String featureId, boolean state) throws SAXNotRecognizedException, SAXNotSupportedException
setFeature
in interface XMLReader
org.xml.sax.XMLReader
name
- The feature name, which is a fully-qualified URI.state
- The requested state of the feature (true or false).SAXNotRecognizedException
- When the
XMLReader does not recognize the feature name.SAXNotSupportedException
- When the
XMLReader recognizes the feature name but
cannot set the requested value.XMLReader.getFeature(java.lang.String)
public void setProperty(java.lang.String propertyId, java.lang.Object property) throws SAXNotRecognizedException, SAXNotSupportedException
setProperty
in interface XMLReader
org.xml.sax.XMLReader
name
- The property name, which is a fully-qualified URI.state
- The requested value for the property.SAXNotRecognizedException
- When the
XMLReader does not recognize the property name.SAXNotSupportedException
- When the
XMLReader recognizes the property name but
cannot set the requested value.
|
Source code is GPL'd at http://xmlconf.sourceforge.net. |
|||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |