Files
CXML/doc/using.html
dlichteblau dbb2732913 utf8-dom fixes.
recoding nach utf-8 jetzt der default.
2005-12-27 01:35:13 +00:00

497 lines
20 KiB
HTML

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>Closure XML</title>
<link rel="stylesheet" type="text/css" href="cxml.css"/>
</head>
<body>
<div class="sidebar">
</div>
<h1>Using the SAX parser</h1>
<a name="parser"/>
<h3>Parsing and Validating</h3>
<p>
CXML is implemented as a SAX parser. (Refer to <a
href="dom.html#parser">make-dom-builder</a> for information about
DOM.)
</p>
<p>
<div class="def">Function CXML:PARSE-FILE (pathname handler &key ...)</div>
<div class="def">Function CXML:PARSE-STREAM (stream handler &key ...)</div>
<div class="def">Function CXML:PARSE-OCTETS (octets handler &key ...)</div>
Parse an XML document.&nbsp;
Return values from this function depend on the SAX handler used.<br/>
Arguments:
</p>
<ul>
<li><tt>pathname</tt> -- a Common Lisp pathname</li>
<li><tt>stream</tt> -- a Common Lisp stream with element-type
<tt>(unsigned-byte 8)</tt></li>
<li><tt>octets</tt> -- an <tt>(unsigned-byte 8)</tt> array</li>
<li><tt>handler</tt> -- a SAX handler</li>
</ul>
<p>
Common keyword arguments:
</p>
<ul>
<li>
<tt>validate</tt> -- A boolean.&nbsp; Defaults to
<tt>nil</tt>. If true, parse in validating mode, i.e. assert that
the document contains a DOCTYPE declaration and conforms to the
DTD declared.
</li>
<li>
<tt>dtd</tt> -- unless <tt>nil</tt>, an extid instance
specifying the external subset to load. This options overrides
the extid specified in the document type declaration, if any.
See below for <tt>make-extid</tt>. This option is useful
for verification purposes together with the <tt>root</tt>
and <tt>disallow-internal-subset</tt> arguments.
</li>
<li><tt>root</tt> -- the expected root element
name, or <tt>nil</tt> (the default).
</li>
<li>
<tt>entity-resolver</tt> -- <tt>nil</tt> or a function of two
arguments which is invoked for every entity referenced by the
document with the entity's Public ID (a rod) and System ID (an
URI object) as arguments. The function may either return
nil, CXML will then try to resolve the entity as usual.
Alternatively it may return a Common Lisp stream specialized on
<tt>(unsigned-byte 8)</tt> which will be used instead. (It may
also signal an error, of course, which can be useful to prohibit
parsed XML documents from including arbitrary files readable by
the parser.)
</li>
<li>
<tt>disallow-internal-subset</tt> -- a boolean. If true, signal
an error if the document contains an internal subset.
</li>
<li>
<tt>recode</tt> -- a boolean. (Ignored on Lisps with Unicode
support.) Recode rods to UTF-8 strings. Defaults to true.
Make sure to use <tt>utf8-dom:make-dom-builder</tt> if this
option is enabled and <tt>rune-dom:make-dom-builder</tt>
otherwise.
</li>
</ul>
<p>
<div class="def">Function CXML:PARSE-DTD-FILE (pathname)</div>
<div class="def">Function CXML:PARSE-DTD-STREAM (stream)</div>
Parse <a
href="http://www.w3.org/TR/2000/REC-xml-20001006#NT-extSubset">declarations</a>
from a stand-alone file and return an object representing the DTD,
suitable as an argument to <tt>validate</tt>.
</p>
<ul>
<li><tt>pathname</tt> -- a Common Lisp pathname</li>
<li><tt>stream</tt> -- a Common Lisp stream with element-type
<tt>(unsigned-byte 8)</tt></li>
</ul>
<p>
<div class="def">Function CXML:MAKE-EXTID (publicid systemid)</div>
Create an object representing the External ID composed
of the specified Public ID, a rod or <tt>nil</tt>, and System ID
(an URI object).
</p>
<p>
<div class="def">Condition class CXML:XML-PARSE-ERROR ()</div>
Superclass of all conditions signalled by the CXML parser.
</p>
<p>
<div class="def">Condition class CXML:WELL-FORMEDNESS-VIOLATION (cxml:xml-parse-error)</div>
This condition is signalled for all well-formedness violations.
(Note that, when parsing document that is not well-formed in validating
mode, the parser might encounter validity errors before detecting
well-formedness problems, so also be prepared for <tt>validity-error</tt>
in that situation.)
</p>
<p>
<div class="def">Condition class CXML:VALIDITY-ERROR (cxml:xml-parse-error)</div>
Reports the violation of a validity constraint.
</p>
<a name="serialization"/>
<h3>Serialization</h3>
<p>
Serialization is performed using <tt>sink</tt> objects. A sink
is an output stream for runes. There are different kinds of sinks
for output to lisp streams, vectors, etc.
</p>
<p>
Technically, sinks are SAX handlers that write XML output for SAX
events sent to them. In practise, user code would normally not
generate those SAX events manually, and instead use a function
like <a href="dom.html#serialization">dom:map-document</a> or <a
href="xmls-compat.html">xmls-compat:map-node</a> to serialize an
in-memory document.
</p>
<p>
In addition to <tt>map-document</tt>, cxml has a set of
convenience macros for serialization (see below for
<tt>with-xml-output</tt>, <tt>with-element</tt>, etc).
</p>
<p>
<div class="def">Function CXML:MAKE-CHARACTER-STREAM-SINK (stream &rest keys) => sink</div>
<div class="def">Function CXML:MAKE-OCTET-VECTOR-SINK (&rest keys) => sink</div>
Return a handle suitable for event-based XML serialization.
</p>
<p>Keyword arguments:</p>
<ul>
<li>
<tt>canonical</tt> -- canonical form, one of NIL, T, 1, 2
</li>
<li>
<tt>indentation</tt> -- indentation level. An integer or <tt>nil</tt>.
</li>
</ul>
<p>
The following <tt>canonical</tt> values are allowed:
</p>
<ul>
<li>
<tt>t</tt> or <tt>1</tt>: <a
href="http://www.w3.org/TR/2001/REC-xml-c14n-20010315">Canonical
XML</a>
</li>
<li>
<tt>2</tt>: <a
href="http://dev.w3.org/cvsweb/~checkout~/2001/XML-Test-Suite/xmlconf/sun/cxml.html?content-type=text/html;%20charset=iso-8859-1">Second
Canonical Form</a>
</li>
<li>
<tt>NIL</tt>: Use a more readable non-canonical representation.
</li>
</ul>
<p>
With an <tt>indentation</tt> level, pretty-print the XML by
inserting additional whitespace.&nbsp; Note that indentation
changes the document model and should only be used if whitespace
does not matter to the application.
</p>
<p>
If namespace support is enabled (the default), these functions use
a namespace normalizer (<tt>cxml:make-namespace-normalizer</tt>).
</p>
<p>
<tt>unparse-document-to-octets</tt> returns an <tt>(unsigned-byte
8)</tt> array, whereas <tt>unparse-document</tt> writes
characters.&nbsp; <tt>unparse-document</tt> is useful together
with <tt>with-output-to-string</tt>.&nbsp; However, note that the
resulting document in both cases is UTF-8 encoded, so the
characters written by <tt>unparse-document</tt> are really UTF-8
bytes encoded as characters.
</p>
<p>
These function provide the low-level mechanism used by the DOM
serialization functions. To serialize a document without building
its DOM tree first, create a sink handle and call SAX functions on that
handle. <tt>sax:end-document</tt> returns the serialized form of
the document described by the SAX events.
</p>
<p>
<div class="def">Macro CXML:WITH-XML-OUTPUT (sink &body body) => sink-specific result</div>
<div class="def">Macro CXML:WITH-ELEMENT (qname &body body) => result</div>
<div class="def">Function CXML:ATTRIBUTE (name value) => value</div>
<div class="def">Function CXML:TEXT (data) => data</div>
<div class="def">Function CXML:CDATA (data) => data</div>
Convenience syntax for event-based serialization.
</p>
<p>
Example:
</p>
<pre>(with-xml-output (make-octet-stream-sink stream :indentation 2 :canonical nil)
(with-element "foo"
(attribute "xyz" "abc")
(with-element "bar"
(attribute "blub" "bla"))
(text "Hi there.")))</pre>
<p>
Prints this to <tt>stream</tt>, which must be an
<tt>(unsigned-byte 8)</tt> stream:
</p>
<pre>&lt;foo xyz="abc"&gt;
&lt;bar blub="bla"&gt;&lt;/bar&gt;
Hi there.
&lt;/foo&gt;</pre>
<p>
(Note that these functions accept both strings and rods, so we
can write <tt>"foo"</tt> instead of <tt>#"foo"</tt> above.)
</p>
<p>
<div class="def">Macro XHTML-GENERATOR:WITH-XHTML (sink &rest forms)</div>
<div class="def">Macro XHTML-GENERATOR:WRITE-DOCTYPE (sink)</div>
Macro <tt>with-xhtml</tt> is a modified version of
Franz' <tt>htmlgen</tt> works as a SAX driver for XHTML.
It aims to be a plug-in replacement for the <tt>html</tt> macro.
</p>
<p>
<tt>xhtmlgen</tt> is included as <tt>contrib/xhtmlgen.lisp</tt> in
the cxml distribution. Example:
</p>
<pre>(let ((sink (cxml:make-character-stream-sink *standard-output*)))
(sax:start-document sink)
(xhtml-generator:write-doctype sink)
(xhtml-generator:with-html sink
(:html
(:head
(:title "Titel"))
(:body
((:p "style" "font-weight: bold")
"Inhalt")
(:ul
(:li "Eins")
(:li "Zwei")
(:li "Drei")))))
(sax:end-document sink))</pre>
<a name="misc"/>
<h3>Miscellaneous SAX handlers</h3>
<p>
<div class="def">Function CXML:MAKE-VALIDATOR (dtd root)</div>
Create a SAX handler which validates against a DTD instance.&nbsp;
The document's root element must be named <tt>root</tt>.&nbsp;
Used with <tt>dom:map-document</tt>, this validates a document
object as if by re-reading it with a validating parser, except
that declarations recorded in the document instance are completely
ignored.<br/>
Example:
</p>
<pre>(let ((d (parse-file "~/test.xml" (cxml-dom:make-dom-builder)))
(x (parse-dtd-file "~/test.dtd")))
(dom:map-document (cxml:make-validator x #"foo") d))</pre>
<p>
<div class="def">Class CXML:SAX-PROXY ()</div>
<div class="def">Accessor CXML:PROXY-CHAINED-HANDLER</div>
<tt>sax-proxy</tt> is a SAX handler which passes all events it
receives on to a user-defined second handler, which defaults
to <tt>nil</tt>. Use <tt>sax-proxy</tt> to modify the events a
SAX handler receives by defining your own subclass
of <tt>sax-proxy</tt>. Setting the chained handler to the target
handler, and define methods on your handler class for the events
to be modified. All other events will pass through to the chained
handler unmodified.
</p>
<p>
<div class="def">Accessor CXML:MAKE-NAMESPACE-NORMALIZER (next-handler)</div>
</p>
<p>
Return a SAX handler that performs <a
href="http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/namespaces-algorithms.html#normalizeDocumentAlgo">DOM
3-style namespace normalization</a> on attribute lists in
<tt>start-element</tt> events before passing them on the next
handler.
</p>
<a name="rods"/>
<h3>Recoders</h3>
<p>
Recoders are a mechanism used by CXML internally on Lisp implementations
without Unicode support to recode UTF-16 vectors (rods) of
integers (runes) into UTF-8 strings.
</p>
<p>
User code does not usually need to deal with recoders in current
versions of CXML.
</p>
<p>
<div class="def">Function CXML:MAKE-RECODER (chained-handler recoder-fn)</div>
Return a SAX handler which passes all events on to
<tt>chained-handler</tt> after converting all strings and rods
using <tt>recoder-fn</tt>, a function of one argument.
</p>
<a name="dtdcache"/>
<h3>Caching of DTD Objects</h3>
<p>
To avoid spending time parsing the same DTD over and over again,
CXML can cache DTD objects. The parser consults
<tt>cxml:*dtd-cache*</tt> whenever it is looking for an external
subset in a document which does not have an internal subset and
uses the cached DTD instance if one is present in the cache for
the System ID in question.
</p>
<p>
Note that DTDs do not expire from the cache automatically.
(Future versions of CXML might introduce automatic checks for
outdated DTDs.)
</p>
<p>
<div class="def">Variable CXML:*DTD-CACHE*</div>
The DTD cache object consulted by the parser when it needs a DTD.
</p>
<p>
<div class="def">Function CXML:MAKE-DTD-CACHE ()</div>
Return a new, empty DTD cache object.
</p>
<p>
<div class="def">Variable CXML:*CACHE-ALL-DTDS*</div>
If true, instructs the parser to enter all DTDs that could have
been cached into <tt>*dtd-cache*</tt> if they were not cached
already. Defaults to <tt>nil</tt>.
</p>
<p>
<div class="def">Reader CXML:GETDTD (uri dtd-cache)</div>
Return a cached instance of the DTD at <tt>uri</tt>, if present in
the cache, or <tt>nil</tt>.
</p>
<p>
<div class="def">Writer CXML:GETDTD (uri dtd-cache)</div>
Enter a new value for <tt>uri</tt> into <tt>dtd-cache</tt>.
</p>
<p>
<div class="def">Function CXML:REMDTD (uri dtd-cache)</div>
Ensure that no DTD is recorded for <tt>uri</tt> in the cache and
return true if such a DTD was present.
</p>
<p>
<div class="def">Function CXML:CLEAR-DTD-CACHE (dtd-cache)</div>
Remove all entries from <tt>dtd-cache</tt>.
</p>
<p>
<em>fixme:</em> thread-safety
</p>
<a name="catalogs"/>
<h3>XML Catalogs</h3>
<p>
External entities (for example, DTDs) are referred to using their
Public and System IDs. Usually the System ID, a URI, is used to
locate the entity. CXML itself handles only file://-URIs, but
many System IDs in practical use are http://-URIs. There are two
different mechanims applications can use to allow CXML to locate
entities using arbitrary Public ID or System ID:
</p>
<ul>
<li>
User-defined entity resolvers can be used to open entities using
arbitrary protocols. For example, an entity resolver could
handle all System-IDs with the <tt>http</tt> scheme using some
HTTP library. Refer to the description of the
<tt>entity-resolver</tt> keyword argument to parser functions (see <a
href="#parser"><tt>cxml:parse-file</tt></a>) to more
information on entity resolvers.
</li>
<li>
XML Catalogs are (local) tables in XML syntax which map External
IDs to alternative System IDs. If, say, the xhtml DTD is
present in the local file system and the local copy has been
registered with the XML catalog, CXML will use the local copy of
the DTD instead of trying to open the version available using HTTP.
</li>
</ul>
<p>
This section describes XML Catalogs, the second solution. CXML
implements <a
href="http://www.oasis-open.org/committees/entity/spec.html">Oasis
XML Catalogs</a>.
</p>
<p>
<div class="def">Variable CXML:*CATALOG*</div>
The XML Catalog object consulted by the parser before trying to
open an entity. Initially <tt>nil</tt>.
</p>
<p>
<div class="def">Variable CXML:*PREFER*</div>
The default "prefer" mode from the Catalog specification, one
of <tt>:public</tt> or <tt>:system</tt>. Defaults
to <tt>:public</tt>.
</p>
<p>
<div class="def">Function CXML:MAKE-CATALOG (&optional uris)</div>
Return a catalog object for the catalog files specified.
</p>
<p>
<div class="def">Function CXML:RESOLVE-URI (uri catalog)</div>
Look up <tt>uri</tt> in <tt>catalog</tt> and return the
resulting URI, or <tt>nil</tt> if no match was found.
</p>
<p>
<div class="def">Function CXML:RESOLVE-EXTID (publicid systemid catalog)</div>
Look up the External ID (<tt>publicid</tt>, <tt>systemid</tt>)
in <tt>catalog</tt> and return the resulting URI, or <tt>nil</tt>
if no match was found.
</p>
<p>
Example:
</p>
<pre>* (setf cxml:*catalog* nil)
* (cxml:parse-file "test.xhtml" nil)
=> Error: URI scheme :HTTP not supported
* (setf cxml:*catalog* (cxml:make-catalog))
* (cxml:parse-file "test.xhtml" nil)
;; no error!
NIL</pre>
<p>
Note that parsed catalog files are cached in the catalog object.
Catalog files cached do not expire automatically. To ensure that
all catalog files are parsed again, create a new catalog object.
</p>
<a name="sax"/>
<h2>SAX Interface</h2>
<p>
A SAX handler is an arbitrary objects that implements some of the
generic functions in the SAX package.&nbsp; Note that no default
handler class is necessary, because all generic functions have default
methods which do nothing.&nbsp; SAX functions are:
<div class="def">Function SAX:START-DOCUMENT (handler)</div>
<div class="def">Function SAX:END-DOCUMENT (handler)</div>
<br/>
<div class="def">Function SAX:START-ELEMENT (handler namespace-uri local-name qname attributes)</div>
<div class="def">Function SAX:END-ELEMENT (handler namespace-uri local-name qname)</div>
<div class="def">Function SAX:START-PREFIX-MAPPING (handler prefix uri)</div>
<div class="def">Function SAX:END-PREFIX-MAPPING (handler prefix)</div>
<div class="def">Function SAX:PROCESSING-INSTRUCTION (handler target data)</div>
<div class="def">Function SAX:COMMENT (handler data)</div>
<div class="def">Function SAX:START-CDATA (handler)</div>
<div class="def">Function SAX:END-CDATA (handler)</div>
<div class="def">Function SAX:CHARACTERS (handler data)</div>
<br/>
<div class="def">Function SAX:START-DTD (handler name public-id system-id)</div>
<div class="def">Function SAX:END-DTD (handler)</div>
<div class="def">Function SAX:START-INTERNAL-SUBSET (handler)</div>
<div class="def">Function SAX:END-INTERNAL-SUBSET (handler)</div>
<div class="def">Function SAX:UNPARSED-ENTITY-DECLARATION (handler name public-id system-id notation-name)</div>
<div class="def">Function SAX:EXTERNAL-ENTITY-DECLARATION (handler kind name public-id system-id)</div>
<div class="def">Function SAX:INTERNAL-ENTITY-DECLARATION (handler kind name value)</div>
<div class="def">Function SAX:NOTATION-DECLARATION (handler name public-id system-id)</div>
<div class="def">Function SAX:ELEMENT-DECLARATION (handler name model)</div>
<div class="def">Function SAX:ATTRIBUTE-DECLARATION (handler ename aname type default)</div>
<br/>
<div class="def">Accessor SAX:ATTRIBUTE-PREFIX (attribute)</div>
<div class="def">Accessor SAX:ATTRIBUTE-NAMESPACE-URI (attribute)</div>
<div class="def">Accessor SAX:ATTRIBUTE-LOCAL-NAME (attribute)</div>
<div class="def">Accessor SAX:ATTRIBUTE-QNAME (attribute)</div>
<div class="def">Accessor SAX:ATTRIBUTE-SPECIFIED-P (attribute)</div>
<div class="def">Accessor SAX:ATTRIBUTE-VALUE (attribute)</div>
</p>
<p>
The entity declaration methods are similar to Java SAX
definitions, but parameter entities are distinguished from
general entities not by a <tt>%</tt> prefix to the name, but by
the <tt>kind</tt> argument, either <tt>:parameter</tt> or
<tt>:general</tt>.
</p>
<p>
The arguments to <tt>sax:element-declaration</tt> and
<tt>sax:attribute-declaration</tt> differ significantly from their
Java counterparts.
</p>
<p>
<i>fixme</i>: For more information on these functions refer to the docstrings.
</p>
</body>
</html>