SAX2 XML Utilities (May 2000)

David Brownell


Table of Contents

About this Software
What do I Download?
Updates, Bug Fixes, and All That
Licensing
Documentation with Docbook
API Reference Documentation in Javadoc
Feature Overview
SAX2 Parsers
About Æ2 (Ælfred2, or Ælfred Enhanced)
Commandline Support
XML Pipelines
XML/XHTML Checking Service
Support for DOM (L1 and L2)
Examples and Documentation
SAX2 Transition
SAX1 v. SAX2
Migrating SAX1 configurations to SAX2
Converting to Namespace Based Processing
Doing Developer Stuff
Other Parsers; DOM; XSLT; Servlet Engines...
Using Classfile Binaries
Using Native Binaries: GCJ
Rebuilding From the Sources
Change History

About this Software

This SAX2 XML Utilities package works with Java-enabled systems supporting JDK 1.1 or later, and includes: a variety of SAX1 and SAX2 parsers, command-line support for validation and other tasks including exchanging XML documents with remote servers, Ælfred Enhanced (Æ2), facilities to process and print XML as part of XML Pipelines, related support for DOM (Levels 1 and 2), basic examples and documentation . See the feature overview in this document for more detailed information.

Because it is open source software, you can take it and modify it as you please, subject only to the license which applies. It uses standard SAX and DOM programming interfaces, so you can use other implementations of those APIs if you choose ... with alternatives including top quality open source software as well as corporate offerings from IBM, Oracle, Sun, and Excelon (the company formerly known as ODI).

The package includes source, full javadoc, and a JAR file with all classes prebuilt. A separate DOM2 package is available with a DOM implementation, source, and documentation.

What do I Download?

If you don't need source or (lots of) javadoc now, all you need to do is put a JAR file in your class path. There are instructions for a shell environment, including a link to download the file.

That's enough to give you a command line validation tool, and to be able to configure a servlet environment to provide you an XML/XHTML checker service that will let you validate documents and resolve the error messages at your leisure.

If you want the full distribution, with source and javadoc, see the distribution point noted in the next section.

Updates, Bug Fixes, and All That

Other versions, or updated status, may be accessible through the current distribution point: http://home.pacbell.net/david-b/xml/ (which also has information about some other XML work I've done). Please report problems to me using email, to <david-b@pacbell.net> . If you're interested in (readonly for now) CVS access, let me know.

Thanks to all those who have sent patches, feedback, or appreciation. And of course, to David Megginson, without whom this package would not exist!

Licensing

This is distributed "as-is", with no warranty of any kind.

The source code in this package is subject to the enclosed license, which is a variant of the "Q Public License 1.0".

Basically, it's an Open Source license which has the following restriction on any modifications you perform: if you change the interface to a Java package in this distribution, you're not allowed to keep using the original package name. Only the owner of a given package defines its interfaces; other interface changes call for a change to the package name.

Documentation with Docbook

This document is written in XML using the XML version of the DocBook 4.0 beta2 DTD (linked from Norm Walsh's site as well as the Oasis docbook site) and formatted using the XSLT HTML stylesheet at available at the same site.

Other DocBook tools exist, especially in the SGML world. Accordingly, some of you may find it to be a useful example of using XML in a powerful document publishing context. The whole system (DocBook, XSL, XML, Java) is a good example of what XML can do for you today.

There's an O'Reilly Duck Book out. It's a good reference, not a tutorial. The bit in section 2.5, where task-oriented categories of markup are summarized, is essential. This book reminds me of why I like browsing hardcopy sometimes, though you can also do it online at the website. Tools to make learning and using DocBook easy are now in place, using only 100% Pure Java.

API Reference Documentation in Javadoc

Full javadoc is included with this distribution. It supports SAX1, SAX2, the parsers, the pipeline support, and utility classes. Note that at this writing, javadoc output is not valid HTML, and also can't be parsed as XML.

Feature Overview

Although this package includes a good SAX 1.0 XML parser, and it's open source, many developers have been using this as the most complete implementation of SAX2 APIs available. Developers using Sun or Oracle parsers (or DOMs) have been able to work with most or many features of SAX2, since its earliest official alpha release, without needing to wait for vendor updates or even to switch parser cores.

Some useful features are available to non-development staff through command line invocation. In particular, it's easy to collect the errors from parsing (or validating) a document, and thus to have shell-script style automation of certain XML-based processes (publication, document exchange, and so on).

Many web based applications need to be able to do more. For example, this package includes a SAX2 wrapper for Swing's HTML parser that reports XHTML output events. (Of course, the standard XML parser will parse real XHTML. In order to work with malformed documents you need an HTML parser ... and possibly a better one than this.) It also supports pipelines that can write widely interoperable XHTML output text, and a simple namespace-aware filter which can pull the links out of a web page.

This package includes simple XML/XHTML validation and link checking tools, built using JSPs for their ease of development. These could help you get a start providing XML and XHTML based web services.

When processing, DOM can be used both to produce and consume streams of SAX events. This package includes the code to use a DOM in such a way, and uses whatever DOM it's given.

SAX2 Parsers

The parsers available through this package all support the SAX2 APIs. They support important optional SAX2 functionality as follows:

ParserValidationHandlersNotes
Ælfred Enhanced (Æ2) no DeclHandler, LexicalHandler (except entity refs) Default parser; highly conformant; the only parser here with SAX1 compatibility
Æ2 Validating yes DeclHandler, LexicalHandler Highly conformant; might be removed or merged
DOM tree "parser" no LexicalHandler (comments, cdata, and "safe" entity refs only) Initialize by setting the dom-node property; parameters to Parser.parse get ignored. Note: must currently use XML mode, not SAX2 default, with an L1 DOM.
Swing HTML parser wrapper no LexicalHandler(comments only) Parses HTML and reports it as being in the XHTML namespace
Sun TR2 driver optional DeclHandler, LexicalHandler Highly conformant and functional; currently uses XML mode for namespaces (not SAX2 default)
Oracle XML parser (v2) wrapper optional no extended handlersAdapts this parser to SAX2, providing access to the existing validation flag

In the notes above, XML mode refers to a parser with the "namespace-prefixes" feature set (to true), and the "namespaces" feature cleared (set to false) - acting just like a SAX1 parser, except for the use of the SAX2 ContentHandler instead of the SAX1 DocumentHandler.

See the SAX2 migration notes on selective validation for important information on the use of the parser features controlling validation.

About Æ2 (Ælfred2, or Ælfred Enhanced)

The Ælfred parser has been around for some time; it was originally written by David Megginson for Microstar, and was not visibly maintained until I decided I wanted another Open Source parser as conformant as XP ... and to have top quality support for all the SAX2 APIs.

The first thing to do was to make it more conformant to the XML specification. Read the Ælfred2 package overview for more information about this parser, what changed, and how conformant it really is. (It's top ranked, but there are some issues that it's not currently worrying about.)

Later steps involved using new SAX2 handlers to provide a cleanly layered implementation of validation over those APIs. To learn about the limitations of that validation, read the validator consumer pipeline stage javadoc. Basically, there are some tokenization level issues which XML has elevated to validity constraints (rather than being well-formedness issues) and those aren't tested. (Still top ranked, though.)

At this writing, there's a separate validating parser. I may well remove that, since it isn't strictly needed given the pipeline framework and full reporting of SAX2 declaration and lexical events.

At one point I ran a profiling tool on this software, and found some things to speed up. Raw parsing speed on a 55 MByte document (big!) was about 20% faster with this parser than with then-current releases from Sun or Xerces. With JDK 1.3rc2, on Win32, they were all about the same speed on that test; it may be that the Hotspot mixed-mode JIT has finally optimized things which made performance under previous JITs inconsistent.

The new name? It's doubly left recursive; isn't that enough?

Commandline Support

As you edit an XML document you may need to validate it so that you can fix any errors that will affect its processing (rendering, playback, etc). And if you're like me, you want to do that from the command line ... where you have easy access to bulk validation, by crafting shell scripts. The information given here works with a Bourne shell, BASH or KSH. If you're using some other environment, such as command.com, you may be able to translate these instructions.

Note that this software works fine with Microsoft's jview, if you don't want to download a better version of Java. If you want to use the HTML parser you'll have to get the JDK 1.1 version of Swing, or (best) upgrade to JDK 1.2.

Setup

Make sure that you have a recent enough version of the JDK by typing the command java -version to see what it tells you. If there is such a command, but it gives a version older than "JDK 1.1.7", upgrade. If there's no such command, install a current version! You can get one at Sun's website, or perhaps from your operating system vendor. JDK 1.2.2 is a fine choice if it's available for your platform, though Linux developers will likely prefer the IBM or Blackdown ports since they're the ones that have JIT comilers. (The versions of Microsoft's JDK 1.1-ish jview that I've tried have worked OK too, though I've not used them as often as ports of Sun's implementation. Command options and setup are different for jview.)

Download the utilities.jar file to someplace you you can get to it easily. Have a copy that you won't overwrite (if you compile this software), perhaps in someplace like a "lib" subdirectory in your home directory. The example here assumes you stored it in that location.

You need to make sure Java will find that JAR file when executing the parser commands. Put it in your CLASSSPATH, something like this (assuming you already have one set):

$ CLASSPATH=$HOME/lib/utilities.jar:$CLASSPATH
$ export CLASSPATH

(Alternatively, pass this explicitly on the command line each time you invoke an XML parser. Java 2 supports command line options like like "-cp $HOME/lib/utilities.jar" to make this simple, but JDK 1.1 is less flexible.)

Set up some shorthand way to invoke these tools. Shell scripts work, as do shell aliases. Another classic technique is just to use a variable. Using Bourne shell (KSH, BASH, etc) syntax to define shell functions, perhaps in a .profile file, define functions such as these. (Or you can just type these into the shell command line.) Note that the spaces are important; don't scrunch things up to parenthesis. Also, if you want to switch default parser or DOM implementations, pass the appropriate "-D" parameters as described later.

# usage: parse filename-or-url
# parses with default parser, reports errors
parse ()
{
    java org.brownell.xml.DoParse $1 null
}

# usage: validate filename-or-url
# validates with default parser, reports errors including validity errors
validate ()
{
    java org.brownell.xml.DoParse $1 "nsfix | validate"
}

# usage: xecho filename-or-url
# parses with default parser, echoes output
xecho ()
{
    java org.brownell.xml.DoParse $1 "nsfix | write ( stdout )"
}

That's it for setup. Some later release may well make the JAR file executable so it's a bit simpler. To understand what those are doing, read the documentation for the DoParse class, and read the section on XML Pipelines later in this document.

Example Command Lines

Assuming you set up the three shell functions as shown above, you can invoke them as shown here. For all commands, the standard error output collects messages about warnings, errors, validity errors, and fatal errors. I'll use the example of a public domain electronic text fuman10.xml [ The Insidious Dr. Fu Manchu by Sax Rohmer], in part because of the author's pseudonym. (Plus I believe in pulp fiction without tree-killing, and in supporting projects such as the HTML Writers Guild Project Gutenberg.)

To parse a local file, and messages about any well formedness errors found in it:

$ parse fuman10.xml

To validate a document, this time right off the web, and print messages about any validity (or other) errors:

$ validate http://www.hwg.org/opcenter/gutenberg/xmlfiles/fuman10.xml

To parse a file, and echo it (in somewhat pretty printed format) to the command standard output:

$ xecho fuman10.xml

You may wish to validate files that claim to be XHTML conformant, to see if they really are conformant. If you're creating XHTML content yourself, that's an extremely good idea. Sometimes you'll find documents with fatal well-formedness errors in the first few lines! You really don't want your documents to cause problems like those.

XML Pipelines

You may have noticed that in the shell functions given above, the parameter after the file (or URL) input was a string. Those strings describe simple ways to catenate different processing components, each of which is a separate class. The strings can identify those classes by name, and some simple components are available with the more convenient names used in the shell functions above.

The full syntax of those strings, and descriptions of a number of composable pipeline stages, is documented with the PipelineFactory class. The power of the framework comes from integrating new kinds of pipeline component, and won't necessarily be available using the string syntax. What's now available is primarily infrastructure, supporting in-process and remote connectivity.

The components used in the shell functions above are only a few of those available. Here is a bit more information about what those components do, and what some other interesting components do:

Pipeline StageDescription
nsfix ensures XML names are correctly prefixed and that prefixes are properly declared
null does nothing; lets errors be reported.
server sends input to remote server (specified by HTTP/HTTPS URL); output is as provided by that server.
tee sends events to another pipeline before continuing processing with this one
validate requires correctly prefixed names and full event stream; tests all significant validity constrants
write you can write "pretty" printed XML and XHTML to files or stdout/stderr; requires namespace prefixes to be correct.

The notion of pipeline stages as components, and of connecting such pipelines in streams, can be a useful way to structure your code for low-overhead processing.

This framework also solves what can be a basic problem for namespace-aware XML programs with both DOM Level 2 and with SAX2: it lets you write well formed XML text with the right XML namespace prefixes and declarations. You can even validate data on the way out, to prevent certain classes of errors from being created, or from spreading.

XML/XHTML Checking Service

This distribution includes a tool that I've found useful when working with XML/XHTML sections of a web site: a simple web-based XML/XHTML checking service. It requires that you have a relatively recent servlet/JSP engine available.

This checker recognizes the XHTML DTDs and will present the W3C "XHTML 1.0" logo if the web page validates using an XHTML DTD. (Assuming the underlying parser reports the startDTD callback with the PUBLIC identifiers; two of the parsers in this package will do so.) It's even set up to use local copies of those DTDs, so validation won't slow down because of delays to download 64K of DTDs on each request.

In fact, since this supports a "check referrer" facility, you can put that logo on your webpage as a clickable XHTML icon. Run /checker.jsp on your server, copy the XHTML icon into the appropriate place, and put some XML like this into your XHTML web pages (providing the right URI for the checker -- if it's not in the document root) so that you can easily revalidate them after changes:

<p><a href="/checker.jsp?REFER=yes">
<img src="/images/vxhtml10.png" alt="valid xhtml 1.0"
    align="middle" border="0"
    width="88" height="31"
    /></a></p>

This particular XHTML checker service doesn't have the limitations of the current experimental W3C XHTML checking service - it will report XML well-formedness errors that the W3C service can't (at this writing) detect.

A few more things of interest: there's also a link checking facility available (reports any broken links), and some basic firewall-aware request filtering (with one obvious hole that'll eventually be fixed).

Setup

Download the examples/checker.war file and install it in any server which understands what to do with Java "Web Applications". Instructions will differ between hosting environments for now. The WAR file is basically an image of a directory that should get installed onto your web server. It may get automatically expanded in some environments; in others you need to do it yourself. At this writing the images and style sheets will not work correctly unless they are available from the document root of the web server. There will usually be a slight delay the first time you access a JSP (after installation or modification) until it is compiled. Then you'll access it at the same speed you access any other servlet.

The Tomcat server (see the later section where other software packages are listed) runs fine on Win32; both shell and COMMAND.COM startup scripts are provided. In Tomcat, you configure this in the server.xml configuration file. As part of the Java Apache project, this will become smoothly integrated with other Apache servers.

For the moment, this WAR file does not include the library classes it relies on. That can easily be done (WEB-INF/lib/utilities.jar) but for now, you must ensure that your server provides them to you as part of your server configuration and operation.

Server Operation for Tomcat

There's a tomcat.sh file in the top level of the Tomcat distribution. It works sort of like an /etc/rc.d script, taking commands like start and stop. However, it's unlike a real daemon startup script in some ways which get in the way of a "hands off" setup:

Environment Setup

Important operational parameters, such as CLASSPATH and the choice of Java environment, come from the command line environment rather than being controlled only through the startup script. That's a classic source of security problems, which may matter if you want to run this securely or on the standard HTTP server port.

On the other hand, if you set up CLASSPATH as described earlier then you won't have to do anything more.

But there's another drawback: because of the way CLASSPATH is set up, you can't use a level 2 DOM implementation. This is because it puts an internal copy of W3C's Level 1 DOM interfaces in front of any more up-to-date Level 2 copy. You'll need to modify the tomcat.sh driver itself to enable use of an L2 dom; and also, force use of a JDK 1.2 JVM.

Output

Tomcat doesn't tie into system logging yet. This means that diagnostics are mostly going to be System.out and System.err messages.

Those messages come directly as output of the script, which (appropriately) jumps into the background. They aren't automatically saved anywhere like "real" server diagnostics. You'd either need to save to a file each time you started the server, or modify the script to save diagnostic output someplace appropriate.

Once you set it up, you should be able to start and stop Tomcat with the appropriate script for your system. It's been pretty robust for me, but then I've not been putting much stress on it either. For what I'm doing just now, and with a JVM with a JIT, it's using at most a 20 MByte memory footprint. That's fine for any system I'm not using, and even for most others - it lets me offload validation to an idle machine!

Support for DOM (L1 and L2)

Although this distribution does not include DOM, a sibling "DOM2" implementation is available. This implements DOM Level 1 and the current version of DOM Level 2, currently in its first Candidate Recommendation (CR) release. It supports such functionality as mutation events and basic traversals. You may, if you choose, use any other DOM implementation.

The support in this package for DOM includes generic "DOM bootstrapping" APIs which are independent of DOM vendor specific APIs for essential facilities like: parsing an XML document into a DOM tree (note the functionality restrictions imposed by missing DOM, and to a smaller degree SAX, APIs), getting a default DOM implementation, generating SAX2 events from a DOM tree.

It also includes support to use DOM trees in processing pipelines, both as a producer and as a consumer (or buffer/filter) component.

Examples and Documentation

If you want to learn about XML and Java, there are several parts of this package you should look at. Of course, programmers will be interested in the javadoc for the APIs, but you don't need to write code to make use of this package. The document you're now reading is the rest of the basic written documenttion.

First, see the about this document section, and the related part of the Makefile, to see how an XML based publication system based on Docbook works. This package is used for the parser it contains, with the formatting work being done by an XSLT engine (XT) as driven by a large stylesheet.

Then, look at the "examples" directory. This is a mostly targeted towards programmers, but if you can read a Makefile you can learn something about how some of the command line tools can work - for example, to validate a document. Yes, you can validate even with a non-validating parser ... if it provides full support for SAX2 callbacks. Type "make" (with no arguments) in that directory to get a short summary of what's currently available. Of particular note may be:

  • The "echo" and "rss" examples (make echo; make rss) which show how SAX2 command line pipeline invocations (in the Makefile) can parse, echo, and even validate an input document.

  • The "domparse" example (make compile; make domparse) shows programmers how to set up three simple pipelines: text to DOM, DOM to text, and text to text. These can be invoked from the command line; this is what it looks to do it directly. See what information DOM discards. (This requires a DOM implementation, ideally the sibling "DOM2" package else you must modify the configuration as described later.)

  • The "nsprint" example (make compile; make nsprint) may be of particular interest to programmers, since it shows how DOM L2 (if you have it) and SAX2 expose namespace information differently for a given file.

  • The "show" example (make compile; make show) which shows the standard SAX2 properties and feature supported by each of the six SAX2 parsers in this package. (And any others you ask it to look at!)

  • The "xhcheck" example (make compile; make xhcheck) shows a complex pipeline, using the Swing HTML parser to read a document and validate its output against a DTD that has been loaded into memory. (Of particular interest if you want an output pipeline that validates data according to some DTD/Schema as it goes by.)

Look at the XML/XHTML checking software; although JSPs (a mix of HTML and Java code) win no prizes in my book for maintainability, they do provide a quick turnaround development cycle, and the checker may help you provide more correct XHTML. These are all in examples/checker.war, which is a JAR/ZIP archive in the format that a web application server should be able to directly import it. Set it up if you like; run it, see what's inside.

SAX2 Transition

As presented earlier, this package includes a variety of SAX2 parsers. If you are new to SAX2, but have experience with SAX1, you may want a brief overview of the road ahead.

SAX1 v. SAX2

While the SAX2alpha parser interface was directly compatible with the SAX1 API, SAX2 now needs an adapter. Moreover, several core SAX1 interfaces have been deprecated, so that you're being strongly advised to completely forget about those interfaces both internally to your code and ... and to provide new APIs for all of your modules that expose any of those SAX1 interfaces. (This is not a functional restriction, and no problems come from using the SAX1 APIs. Accordingly, feel free to ignore these compile-time deprecation warnings in any code that does not require new SAX2 features, and which isn't part of an API you are creating, exporting, and supporting.) Instead, use the new interfaces:

SAX1/SAX2alphaSAX2Why the Change?
AttributeList Attributes Namespace info became easily available.
DocumentHandler ContentHandler You are given element and attribute namespace info as well as prefix mapping scopes, and know about skipped entities.
HandlerBase DefaultHandler Supports ContentHandler methods (but not LexicalHandler or DecllHandler)
ParserXMLReader Name change needed due to namespace and other change. The new class incorporates what SAX2alpha had in the Configurable interface to get, set, and clear feature flags (and the values of named properties) and uses ContentHandler.
ParserFactoryXMLReaderFactory Bootstrapping API; uses new system property org.xml.sax.driver which may identify either a SAX1 or SAX2 parser.
  DeclarationHandler (new in SAX2)
  LexicalHandler (new in SAX2)

Changes in the final release of SAX2 include using the standard term "qualified name" where previously the nonstandard term "raw name" was used, and use of JDK 1.2 javadoc features. Also, the NamespaceSupport class has new methods to expose the current prefixes, and the XMLReaderFactory can now fall back to using a SAX1 Parser to create a SAX2 XMLReader.

Element and attribute names are now expressed primarily in tuples of a namespace URI and a local name, rather than in terms of XML names. This is exposed through the element callback, and the other changes cascaded from there.

SAX2 parsers support an open list of named "feature" flags and "property" objects. The names are URIs. SAX2 defines a number of these identifiers; many of them changed before the final release All new handlers are exposed as properties.

In the SAX2 release version, the declaration and lexical handlers are optional. Historically, this is strange, since the functionality in those interfaces was what got the very first SAX2 discussions going, in late 1998, and unlike the mandatory namespace functionality, these features can't be layered. This distribution strongly depends on them, so they are not optional so far as this distribution is concerned.

Migrating SAX1 configurations to SAX2

If you weren't relying on features of your current parser configuration beyond basic SAX1 parser capability (e.g., always validating, or else never validating) there's a relatively simple migration path available to you. If you were relying on such features, you have more work to do. In both cases, you may find this simple procedure to be a useful basis for your work.

  1. You should normally set the org.xml.sax.driver system property if you have a simple way to do this (such as setting it on the command line invocation of many systems).

    Use the name of the SAX1 parser you are currently using, org.brownell.xml.aelfred2.Validator if you were always validating, or org.brownell.xml.aelfred2.SAXDriver which is the compiled-in default if you don't (or can't) provide such a property setting.

  2. Convert every current SAX1 Parser creation call. The best code currently uses ParserFactory, so it's easy to find these cases. Otherwise, you may need to search for Parser-typed variables, and change their assignments (perhaps they directly invoke the constructor for some specific parser).

    Change these to call the XMLReaderFactory.createXMLReader method. Unless you are selectively validating, or need to use more than one parser, you normally have no reason to pass a parser name to this function.

    Selective Validation

    SAX2 supports essentially three ways to handle selective validation.

    The simplest of them is to always do XML validation, and explicitly ignore the ErrorHandler.error callbacks except when you are validating. You'll probably pay on the order of 10% in terms of parse speed (and see extra heap pressure during parsing) to use this option. You'll also forgo reporting of the handful of nonfatal errors that aren't validity errors. Those costs may be fine.

    The second is common, and moreover is the only option supported by some parsers. It involves using a non-validating parser as your default, switching to a validating parser (perhaps org.brownell.xml.aelfred2.Validator ) only when you really need to.

    The SAX2 pipeline package in this distribution can do this for you automatically. Provide a validation filter near the head of the filter chain, and a default parser which exposes enough of the lexical and DTD callbacks. (The package normally does this using Æ2, but you could use the wrapped Sun parser or the Xerces non-validating parser.)

    The third technique is not widely supported at this writing. It is to use a default parser that supports toggling the SAX2 parser feature flag controlling validation. Since that flag may be readonly, my advice is to use the second option above (preferably with the pipeline package), with the system default parser as a non-validating parser (such as Æ2).

    Using Multiple Parsers

    Most applications really only care about whether the document parses. But you may need to care about using particular parsers. The XML/XHTML checker service example lets users choose which parser to use, since some of its role is to support such comparisons of parser behaviors. More commonly, you may need to selectively use a parser capability that not all parsers support.

  3. At each of those spots you created a parser, modify the parser so that it sets the namespace-prefixes feature (to true) rather than using its default. This makes sure that the parser won't discard the XML name prefixes. It's harmless for code that doesn't depend on them, but this isn't the SAX2 default. (Perhaps because a "DOM Parser", or any similar producer of a SAX2 event stream which isn't an XML parser, may not be able to provide the data it implies.)

    Unless your current code is already namespace processing itself (using SAX1 APIs, not the SAX2 ones), you might even want to clear the namespace-prefixes parser feature (set its value to false). This tells the parser not to do namespace processing, which saves something like 10% in processing time.

  4. Convert all DocumentHandler implementations so they support the SAX2 ContentHandler API. It's simplest to just do a global substitution of the SAX1 interface name for the SAX2 one, though if you are exposing such classes as part of your own API you will need to continue implementing methods (and provide an adapter between AttributeList objects and the SAX2 Attributes ones of corresponding functionality.

    Add implementations of the three new SAX2 methods, ContentHandler.ignoredEntity, ContentHandler.startPrefixMapping, and ContentHandler.endPrefixMapping (just provide methods with an empty body).

    You'll have to change the syntax of all your existing startElement and endElement methods. Just add two new string parameters (the namespace URI, and the local name) to the beginning of the parameter declarations. Ignore the values, unless your current code is namespace-aware (using SAX1). You'll also need to change the type of the input parameter that wraps the attributes.

    If your current code uses namespaces, then don't change it to use the SAX namespace URIs yet ... but do compare the namespace values you compute (for elements and for attributes) with the values provided by the SAX2 parser, and log the error if they aren't the same. You may uncover bugs in your code through such comparisons; since they'll likely percolate the data you were manipulating, correcting such bugs needs to be done with care. (Also, they need to be fixed quickly: that old code is going keep spreading broken data as long as it's in use.)

    Unless you needed to keep DocumentHandler suport (e.g. you exported it as one of your own interfaces, so your customers rely on it and you can't retract) then you should get no deprecation warnings (or any kind of errors) from building of your software.

  5. Now test the software. Nothing should behave differently than it did before; if anything changed, find out why, and fix it.

    If everything works now just like it did before, congratulations! You've just converted your API to use SAX2.

  6. The last step is optional, unless you're already using namespaces: convert your code to actually use the SAX2 namespace support.

    If you were computing namespace information yourself, using the SAX1 data, switch over to use what the parser provides. As noted above, if your earlier namespace code had bugs, you may also need to correct the data it had previously accepted.

Converting to Namespace Based Processing

After you've done a basic conversion, and verified that your code didn't break, you can start to take advantage of namespace functionality exposed in the new SAX2 APIs.

This involves changes in how you compare the names of elements and attributes: where you had previously compared just the XML 1.0 names, now you should instead compare the namespace URI and the local name (of the element or attribute) within that namespace. (Both must be identical.) Be careful of the so-called "default namespace" case, where SAX provides an empty namespace URI and local name; when you get that case, fall back to straight XML 1.0 handling.

One characteristic of namespace-based processing is that you may have new types of code in your application. Where previously you just compared names (and now compare more complex ones), now you can also search for elements and attributes as found within a given namespace. This means that new modules can dispatch data or calls based on namespace. Depending on the policy used in such dispatching, the modules to which they dispatch may be able to assume they only see one namespace URIs, which can substantially simplify processing.

If your application uses DOM instead of SAX, you have a simpler task. Just switch to a Level 2 DOM, then you won't have to change any call syntax until you want to take advantage of namespace functionality; there's no such abrupt transition. (Other tradeoffs often make me avoid DOM, though; notably, its implicit costs in terms of memory.)

Doing Developer Stuff

This should be pretty self-explanatory, and you have the source. However, it's not targeted at new-to-Java developers; sorry! (The closest you'll find is information about how to set up your command line environment to run simple tools, since your Java compiler takes roughly the same setup.) You're expected to know how to run "make" and to figure out what commented source code is doing.

Other Parsers; DOM; XSLT; Servlet Engines...

There area lot of other parsers available; you should be able to use any of them, without much trouble. If the parser you want to use only offers SAX 1.0 API support, then it'll often be automatically wrapped in an adapter so that at least its basic parsing functionality is accessible.

To change the default parser or DOM used by this package, provide system properties. This is often done using "-Dproperty=value" syntax on the command line invocation of Java, where the value will be the name of a class. The system properties to care about are: org.xml.sax.parser (to change the default SAX1 parser), org.xml.sax.driver (to change the default SAX2 parser), org.brownell.xml.dom (to change the default DOM DOMImplementation class, which is useful only for DOM L2 implementations), and org.brownell.xml.DomBuilder.Document (to change the default DOM Document class, which is useful for both DOM L1 and DOM L2 implementations). (At some point the two different DOM-related properties clearly need to be rationalized!)

You may want to use a number of other tools with this package:

  • The sibling "DOM2" package, implementing DOM (Level 1, and Level 2 CR1). Install it in the .../dom2 directory. Note: this DOM implementation is the default for this package, and will be used if you don't assign any of the system properties mentioned above.

  • James Clark's XT XSLT engine. Install it in the .../xt directory.

  • The Tomcat Servlet Engine including support for Java Server Pages (JSPs, which provide a rapid turnaround servlet based development environment). I've been using release 3.1m1. Put this into a .../tomcat directory, and make sure your CLASSPATH environment variable is set when you start Tomcat. You won't be able to use an L2 DOM without using JDK 1.2 and changing the Tomcat CLASSPATH setup, however.

  • Sun's Project X TR2 package provides an extended SAX 1.0 parser and a Level 1 DOM implementation; these have been proven to be fast, conformant, and stable. Install it in the ".../xml-tr2" directory. This has been bundled with recent servlet developer packages as well as Tomcat releases, as "xml.jar".

  • Oracle's latest parser, which also includes SAX 1.0 and a Level 1 DOM implementation. Install that in the .../oracle2 directory. Note that this package includes an XSLT engine.

  • the Xerces/J parser; 1.0.3 includes SAX2 support (in current versions) as well as some Level 2 DOM support. Install it in the .../xerces directory. From the same source there is yet another Java implementation of XSLT, called Xalan.

All of those three XSLT engines include some sort of "XSL Servlet", which does dynamic XSL formatting. You can use them with recent servlet engines, including Tomcat; see the documentation that comes with your XSLT engine.

Note that there are other XSLT engines, such as SAXON, in which you may also be interested.

Using Classfile Binaries

Just put the utilities.jar in the -classpath ... starting up your applications, or put it in your own CLASSPATH environment variable. This should work fine on JDK 1.1 (and on JDK 1.2 of course). You can then use these classes whenever you run or compile a Java program.

You may want to add the JAR files for any other packages XML processing packages you have. For the sibling DOM2 distribution, .../dom2/dom2.jar; for the XT XSLT engine, .../xt/xt.jar; to compile anything against servlets, .../tomcat/lib/servlet.jar; for Sun's TR2 parser, .../xml-tr2/xml.jar; for Oracle's v2 parser, .../oracle2/xmlparserv2.jar; and so on.

See the examples directory, as described elsewhere in this document.

Using Native Binaries: GCJ

This package avoids dependencies on Sun's JDK 1.2 (or later), in order to reach the large audience which will not generally have such versions installed. A benefit of that is that non-Sun JVM implementations can be used. I think the most interesting implementation is the GNU Compiler for Java (GCJ), which uses the well known GCC back end to generate native Java code that's largely identical to the corresponding C++ code. See the include files that GCJH generates to get the idea.

Note that GCJ is still either "cutting edge" or "bleeding edge", depending on which whether your glass is half full or not. At this writing, it's still hard to get running; you'll need recent snapshots, and it's not clear when the next GCC release (needed to fix some GCJ problems) will happen.

You can work with GCJ and generate native binaries. I've done my work with this on Linux/x86, but I understand that it also works on Win32 (using Cygwin). When I first measured, JDK 1.2.2 (with JIT) was faster on some code I measured, slower on other code.

You'll need to "make native" at the top level, and also in the "examples" subdirectory to see things work. And probably play around a lot ...

Rebuilding From the Sources

This is one of several bundles of software you need, in addition to having a copy of the JDK. (JDK 1.1 or later; JDK 1.2 or later is required to rebuild the javadoc.) You may need such a JDK even if you're interested in using native code, since GCJ can't currently compile these sources (due to some JDK 1.1-isms that the GCC Java front-end doesn't yet know about). All in all, you'll need:

  • This "utilities" package (see above for its distribution point). I have it in a directory named .../utilities.

  • Class files for the DOM L2 interfaces; I have these in a directory named .../dom2, along with the "dom2" implementation classes.

  • You'll need some version of the servlet.jar file to compile the XML servlet base class. Any version starting with 1.1 should do the job; I've been using the Apache Jakarta project as an open source web server with Servlet 2.2 API and "web application" support.

I run this from both Win32 and the "Cygwin" shell environment (http://sourceware.cygnus.com/cygwin), and from Red Hat Linux 6.0. It works in both environments, with the same Makefile, using GNU make, and either JDK 1.1 or JDK 1.2 (though the javadoc won't work except on JDK 1.2).

You should be able to just type "make" in the directory for these utilities. With JDK 1.1, you may optionally set SWING_HOME to point to your version of the Swing distribution; otherwise, the Makefile will not build the HTML parser.

To compile with GCJ, first get it; see the official sourceware pages for both GCC (EGCS) and LIBGCJ downloads. I've done my testing on the Linux x86 platform, with current snapshots; expect results to vary. (The 1 December GCJ worked, but the 8 December snapshot wouldn't compile LIBGCJ. Such experiences are common.) Then, after having created the JAR file, type "make native". Look at the examples, too.

Change History

  • In May, the final SAX2 release was incorporated, and documentation was correspondingly updated. Fixes for some bugs, including two Æ2 bugs reported by Michael Kay, were incorporated. The documenttion no longer mentions interim SAX2 drafts where that is not appropriate.

  • In February, many changes were released. These included converting the documentation to XML docbook, providing a JSP based XML/XHTML validation tool, highlighting command line facilities for bulk validation, SAX2 beta1 (big changes) and then beta2 (smaller) API conversions, more pipeline classes and components, DTD preload capability in the validator component, better XHTML support, better interoperability with DOM Level 1, and the usual level of minor bugfixes and cleanup. This included a longstanding buffering bug in Ælfred.

  • January Y2K brought some performance work on the Æ2 parsers. On at least large files, this is now faster than "Original Ælfred" by about 20% in one test. It also included more XML conformance work as implied by some recently published errata to the XML spec. You can disable namespace support when building with a level 2 DOM implementation, if you want.

  • December 1999 saw the addition of a new pipeline component which does XML validation, and a SAX parser that combines it with Ælfred to create a validating parser. It caught up to the first "CR" of the DOM L2 specification, no longer creating DTD nodes (due to substantial API problems). The latext XHTML spec is supported. String interning in Ælfred became consistently global, versus largely local. The GNU Compiler for Java (GCJ) is supported; when you can get it to work, it's at least competitive with a good JIT (like IBM's).

  • In October various bugs were fixed (including feature/property/handler identifiers), and the enhanced Ælfred now supports most of the SAX2 "Lexical" handler, except reporting of external entity boundaries. The "pipeline" package made its appearance, as did some RSS examples. The Ælfred code was modified to match the One True Coding Style more closely. Also, the DOM builder was taught about using namespace support with DOM implementations which support the new Level 2 namespace feature.

  • In September, this package was updated to include the conformance-enhanced Ælfred (in the org.brownell.xml.aelfred2 package), so it no longer needed Sun's parser. It also bundled a driver for Oracle's parser, so there was another SAX2alpha option for validation. The EchoHandler was updated to emit XHTML, and the HTML parser became more robust.

  • The first release of this package was made 11 June, 1999 and was the first to include SAX2alpha support. It was updated later that month to include the DOM parser and builder, and some examples.