How do I construct a parser in XML4J version 2?

 

In XML4J version 2, the DOM api is implemented using the SAX api. XML4J version 2 has a modular architecture and comes pre-bundled with 4 configurations of the parser (all in com.ibm.xml.parsers package). These are:

  • SAX parser, Non Validating (com.ibm.xml.parsers.SAXParser)
  • SAX parser, Validating (com.ibm.xml.parsers.ValidatingSAXParser)
  • DOM parser, Non Validating (com.ibm.xml.parsers.NonValidatingDOMParser)
  • DOM parser, Validating (com.ibm.xml.parsers.DOMParser.java)

There are two ways the parser classes can be instantiated:

The first way is to create a string containing the fully qualified name of the parser class. Pass this string to the org.xml.sax.helpers.ParserFactory.makeParser() method to instantiate it. This method is useful if your application will need to switch between different parser configurations. The code snippet shown below is using this method to instantiate a (validating) DOMParser.

import org.xml.sax.Parser;
import org.xml.sax.helpers.ParserFactory;
import com.ibm.xml.parsers.DOMParser;
import org.w3c.dom.Document;
...
String parserClass = "com.ibm.xml.parsers.DOMParser";
String xmlFile = "file:///xml4j2/data/personal.xml";
Parser parser = ParserFactory.makeParser(parserClass);
try {
	parser.parse(xmlFile);
} catch (SAXException se) {
    se.printStackTrace();
} catch (IOException ioe) {
    ioe.printStackTrace();
}
// The next line is only for DOM Parsers
Document doc = ((DOMParser) parser).getDocument();
...

The second way to instantiate a parser class is to explicitly instantiate the parser class, as shown in this example, which is creating a validating DOM Parser. Use this way when you know exactly which parser configuration you need, and you are sure that you will not need to switch configurations.

import com.ibm.xml.parsers.DOMParser;
import org.w3c.com.Document;
...
String xmlFile = "file:///xml4j2/data/personal.xml";
DOMParser parser = new DOMParser();
try {
    parser.parse(xmlFile);
} catch (SAXException se) {
    se.printStackTrace();
} catch (IOException ioe) {
    ioe.printStackTrace();
}

// The next line is only for DOM Parsers
Document doc = parser.getDocument();
...

Once you have the Document object, you can call any method on it as defined by the DOM specification.

 
   

How do I create a DOM parser?

Use one of the methods in the question above, and use com.ibm.xml.parsers.DOMParser to get a validating parser and com.ibm.xml.parsers.NonValidatingDOMParser to get a non-validating parser.

To access the DOM tree, you can call the getDocument() method on the parser.

How do I create a SAX parser?

Use one of the methods in the question above, and use com.ibm.xml.parsers.DOMParser to get a validating parser and com.ibm.xml.parsers.NonValidatingDOMParser to get a non-validating parser.

Once you have the parser instance, you can use the standard SAX methods to set the various handlers provided by SAX.

How do I create a parser compatible w/ XML4J version 1?

It is our intent to deprecate a significant portion of the API from XML4J version one as we develop enhanced functionality in version 2. As an aid to developers who are currently using XML4J version 1, classes in the com.ibm.xml.parser and com.ibm.xml.xpointer packages are provided for backward compatibility. If you need parser functionality that is provided in version 1 and there is no corresponding functionality in version 2, you can use these "TX compatibility" classes in the same way that you used them in version 1. However, you cannot mix and match classes between the native classes and the TX compatibility classes. You should use the version 1 method for creating a parser class and causing the parser to read its input, as well as for setting all options. The DOM returned by the compatiblity classes will be an instance of the TX* classes from version 1.

Not all the functions which are available on com.ibm.xml.parser.Parser are supported or implemented:

    Not supported: Calling these methods will throw java.lang.IllegalArgumentException.

    • addNoRequiredAttributeHandler
    • getReaderBufferSize
    • setErrorNoByteMark
    • setReaderBufferSize

    Not implemented: These methods are present but should not be expected to function the same as in the old parser.

    • setProcessExternalDTD
    • setWarningNoDoctypeDecl
    • setWarningNoXMLDecl
    • setWarningRedefinedEntity
    • stop

Version 1 occasionally inserted extra TX nodes in it's DOM tree. Even though the compatibility classes provide a TX DOM tree, these extra nodes will not be present. If your application relies on the presence of these extra nodes, you will need to modify your code.

Users who are moving to the new parser architecture but want to use the catalog file format supported by the old parser should use the com.ibm.xml.internal.TXCatalog class. See the question "How do I use catalogs?".

What new options are available on parsers?

  • setAllowJavaEncodingname() - If set to true, allow Java's names for encodings to be used as well as the names defined by the XML standard.
  • setWarningOnDuplicateAttDef() - If set to true, warn if there are duplicate attribute definitions.
  • setCheckNamespace() - If set to true, perform syntactic checking of namespaces when they are present.
  • setContinueAfterFatalError() - If set to true, keep processing, even if a fatal error occurs
  • setDocumentTypeHandler() - set the XMLDocumenthandler
  • setEntityHandler() - set the EntityHandler
  • setValidationHandler() - set the ValidationHandler
  • setDocumentHandler() - set the SAX DocumentHandler
  • setLocale() - set the locale to use for messages
  • setEntityResolver() - set the SAX EntityResolver
  • setDTDHandler - set the SAX DTDHandler
  • setErrorHandler() set the SAX ErrorHandler

How do I use namespaces?

In XML4J version 2, there are two methods of namespace support. The first method is to use the TX compatibility classes, which provide an API for dealing with namespace information.

The second method is to use the native API support. In this release, you can call the setCheckNamespace() method on one of the parser configuration classes (e. g. com.ibm.xml.parsers.ValidatingSAXParser, and XML4J version 2 will verify that all namespace attributes have the correct syntactic form.

How do I use catalogs?

XML4J Version 2 supports two catalog file formats: the SGML Open catalog that was supported in version 1, and the proposed XCatalog specification.

To use the original catalog file format, set a TXCatalog instance as the parser's EntityResolver. For example:

  XMLParser parser  = new DOMParser();
  Catalog   catalog = new TXCatalog(parser.getParserState());
  parser.getEntityHandler().setEntityResolver(catalog);

Once the catalog is installed, catalog files that conform to the TXCatalog format can be appended to the catalog by calling the loadCatalog method on the parser or the catalog instance. The following example loads the contents of two catalog files:

  parser.loadCatalog(new InputSource("catalogs/cat1.xml"));
  parser.loadCatalog(new InputSource("http://host/catalogs/cat2.xml"));

To use the XCatalog catalog, you must first have a catalog in XCatalog format. The current version of the XCatalog catalog supports the XCatalog proposal draft 0.2 posted to the xml-dev mailing list by John Cowan. XCatalog is an XML representation of the SGML Open TR9401:1997 catalog format. The current proposal supports public identifier maps, system identifier aliases, and public identifier prefix delegates. Refer to the XCatalog DTD for the full specification of this catalog format at http://www.ccil.org/~cowan/XML/XCatalog.html.

In order to use XCatalogs, you must write the catalog files with the following restrictions:

  • You must follow the XCatalog grammar.
  • You must specify the <!DOCTYPE> line with the PUBLIC specified as "-//DTD XCatalog//EN" or make sure that the system identifier is able to locate the XCatalog 0.2 DTD (which is included in the Jar file containing the com.ibm.xml.internal.XCatalog class). For example:
      <!DOCTYPE XCatalog PUBLIC "-//DTD XCatalog//EN" "com/ibm/xml/internal/xcatalog.dtd">
      
  • The enclosing <XCatalog> document root element is not optional -- it must be specified.
  • The Version attribute of the <XCatalog> has been modified from '#FIXED "1.0"' to '(0.1|0.2) "0.2"'.

To use this catalog in a parser, set an XCatalog instance as the parser's EntityResolver. For example:

  XMLParser parser  = new SAXParser();
  Catalog   catalog = new XCatalog(parser.getParserState());
  parser.getEntityHandler().setEntityResolver(catalog);

Once installed, catalog files that conform to the XCatalog grammar can be appended to the catalog by calling the loadCatalog method on the parser or the catalog instance. The following example loads the contents of two catalog files:

  parser.loadCatalog(new InputSource("catalogs/cat1.xml"));
  parser.loadCatalog(new InputSource("http://host/catalogs/cat2.xml"));

Limitations: The following are the current limitations of this XCatalog implementation:

  • No error checking is done to avoid circular Delegate or Extend references. Do not specify a combination of catalog files that reference each other.
 

How do I use the revalidation API?

In this release of XML4J version 2, you can validate a document after it has been parsed and converted to a DOM tree. To do this, you can use the RevalidatingDOMParser or the TXRevalidatingDOMParser classes. The validate method on this class takes a DOM node as an argument, and performs a validity check on the DOM tree rooted at that node, using the DTD of the current document. Currently, the native DOM prevents the insertion of invalid nodes, so this feature is not as useful for the native DOM.

This is an experimental feature, and the details of its operation will change in future releases of XML4J version 2. We are including it in order to hear your feedback on the functionality of these API's.

The sample program below parses a document, inserts an illegal node into the TX DOM and then tries to re-validate the document.

import java.io.IOException;
import com.ibm.xml.parser.TXElement;
import com.ibm.xml.parsers.TXRevalidatingDOMParser;
import org.xml.sax.SAXException;
import org.w3c.dom.Document;
import org.w3c.dom.Node;

public class RevalidateSample {
    public static void main(String args[]) {
        String xmlFile = "file:///Work/xml4j2/data/personal.xml";
        TXRevalidatingDOMParser parser = new TXRevalidatingDOMParser();
        try {
            parser.parse(xmlFile);          
        } catch (SAXException se) {
            System.out.println("SAX error while parsing: caught "+se.getMessage());
            se.printStackTrace();
        } catch (IOException ioe) {
            System.out.println("I/O Error while parsing: caught "+ioe);
            ioe.printStackTrace();
        }

        Document doc = parser.getDocument();
        
        System.out.println("Doing initial validation");
        Node position = parser.validate(doc.getDocumentElement());
        if (position == null) {
            System.out.println("ok.");
        } else {
            System.out.println("Invalid at " + position);
            System.out.println(position.getNodeName());
        }

        // Now insert dirty data
        Node junk = new TXElement("bar");
        Node corruptee = doc.getDocumentElement();
        System.out.println("Corrupting: "+corruptee.getNodeName());
        corruptee.insertBefore(junk,corruptee.getFirstChild().getNextSibling());
        
        System.out.println("Doing post-corruption validation");
        position = parser.validate(doc.getDocumentElement());
        if (position == null) {
            System.out.println("ok.");
        } else {
            System.out.println("Invalid at " + position);
            System.out.println(position.getNodeName());
        }

    }
}

How do I handle errors?

When you create a parser instance, the default error handler does nothing. This means that your program will fail silently when it encounters an error. You should register an error handler with the parser by supplying a class which implements the org.xml.sax.ErrorHandler interface. This is true regardless of whether your parser is a DOM based or SAX based parser.

IBM

alphaWorks

XML For Java

communityXchange-XML for Java