utf8-dom fixes.

recoding nach utf-8 jetzt der default.
This commit is contained in:
dlichteblau
2005-12-27 01:35:13 +00:00
parent 42987f5dba
commit dbb2732913
12 changed files with 191 additions and 59 deletions

View File

@ -22,10 +22,33 @@
To parse an XML document into a DOM tree, use the SAX parser with a
DOM builder as the SAX handler. Example:
</p>
<pre>(cxml:parse-file "test.xml" (dom:make-dom-builder))</pre>
<pre>(cxml:parse-file "test.xml" (cxml-dom:make-dom-builder))</pre>
<p>
<div class="def">Function DOM:MAKE-DOM-BUILDER ()</div>
<div class="def">Function CXML-DOM:MAKE-DOM-BUILDER ()</div>
Create a SAX handler which builds a DOM document.
<p>
</p>
This functions returns a DOM builder that will work with the default
configuration of the SAX parser and is guaranteed to use
characters/strings instead of runes/rods, if that makes a
difference on the Lisp in question.
<p>
</p>
This is the same as <tt>rune-dom:make-dom-builder</tt> on Lisps
with Unicode support, and the same as
<tt>utf8-dom:make-dom-builder</tt> otherwise.
</p>
<p>
<div class="def">Function RUNE-DOM:MAKE-DOM-BUILDER ()</div>
Create a SAX handler which builds a DOM document using runes and rods.
</p>
<p>
<div class="def">Function UTF8-DOM:MAKE-DOM-BUILDER ()</div>
(Only on Lisps without Unicode support:)
Create a SAX handler which builds a DOM document using
UTF-8-encoded strings.
</p>
<a name="serialization"/>
@ -63,6 +86,12 @@
<tt>include-default-values</tt> -- include attribute nodes with nil
<tt>dom:specified</tt>.
</li>
<li>
<tt>recode</tt> -- (ignored on Lisps with Unicode support.) If
true, recode UTF-8 strings to rods. Defaults to true if used
with a UTF-8 DOM document. It can be set to false manually to
suppress recoding in this case.
</li>
</ul>
<p>

View File

@ -33,7 +33,7 @@
<li><a href="using.html#parser">Parsing and Validating</a></li>
<li><a href="using.html#serialization">Serialization</a></li>
<li><a href="using.html#misc">Miscellaneous SAX handlers</a></li>
<li><a href="using.html#rods">Dealing with Rods</a></li>
<li><a href="using.html#rods">Recoders</a></li>
<li><a href="using.html#dtdcache">Caching of DTD Objects</a></li>
<li><a href="using.html#catalogs">XML Catalogs</a></li>
<li><a href="using.html#sax">SAX Interface</a></li>
@ -67,7 +67,7 @@
<p>Parse <tt>example.xml</tt> into a DOM tree (<a href="using.html#parser">read
more</a>):</p>
<pre>* <b>(cxml:parse-file "example.xml" (dom:make-dom-builder))</b>
<pre>* <b>(cxml:parse-file "example.xml" (cxml-dom:make-dom-builder))</b>
#&lt;DOM-IMPL::DOCUMENT @ #x72206172>
;; save result for later:
* <b>(defparameter *example* *)</b>

View File

@ -69,6 +69,13 @@
<tt>disallow-internal-subset</tt> -- a boolean. If true, signal
an error if the document contains an internal subset.
</li>
<li>
<tt>recode</tt> -- a boolean. (Ignored on Lisps with Unicode
support.) Recode rods to UTF-8 strings. Defaults to true.
Make sure to use <tt>utf8-dom:make-dom-builder</tt> if this
option is enabled and <tt>rune-dom:make-dom-builder</tt>
otherwise.
</li>
</ul>
<p>
@ -258,7 +265,7 @@
ignored.<br/>
Example:
</p>
<pre>(let ((d (parse-file "~/test.xml" (dom:make-dom-builder)))
<pre>(let ((d (parse-file "~/test.xml" (cxml-dom:make-dom-builder)))
(x (parse-dtd-file "~/test.dtd")))
(dom:map-document (cxml:make-validator x #"foo") d))</pre>
@ -287,40 +294,15 @@
</p>
<a name="rods"/>
<h3>Dealing with Rods</h3>
<h3>Recoders</h3>
<p>
As explained above, the XML parser handles character encoding and
uses 16bit strings internally. Instead of using characters and strings
it uses <em>runes</em> and <em>rods</em>. This is seen as a
feature, but can be inconvenient.
Recoders are a mechanism used by CXML internally on Lisp implementations
without Unicode support to recode UTF-16 vectors (rods) of
integers (runes) into UTF-8 strings.
</p>
<ul>
<li>
If your Lisp supports 16 bit unicode strings, use feature
<tt>:rune-is-character</tt> and forget about runes and rods.
CXML will use ordinary Lisp characters and strings both
internally and externally.
</li>
<li>
If your Lisp does not support such strings and your application
needs Unicode support, use functions defined in the
<tt>runes</tt> package instead of ordinary string operators.
</li>
<li>
If your Lisp does not support such strings and your application
does not need Unicode support anyway, it will probably be more
convenient to let CXML convert rods into strings automatically.
To do that, use <tt>cxml:make-recoder</tt> to chain a special
sax handler between the parser and your application handler.
The recoder translates all rods using an application defined
function, which defaults to <tt>runes:rod-string</tt>. Although
the actual XML parser still uses rods internally, you SAX
handler will only see ordinary Lisp strings.
</li>
</ul>
<p>
Note that the recoder approach does <em>not</em> work with the DOM
builder, since DOM is specified to use UTF-16.
User code does not usually need to deal with recoders in current
versions of CXML.
</p>
<p>
<div class="def">Function CXML:MAKE-RECODER (chained-handler recoder-fn)</div>
@ -328,16 +310,6 @@
<tt>chained-handler</tt> after converting all strings and rods
using <tt>recoder-fn</tt>, a function of one argument.
</p>
<p>
<b>Example.</b> In a Lisp which ordinarily would use octet vector rods:
</p>
<pre>CL-USER(14): (cxml:parse-string "&lt;test/&gt;" (cxml-xmls:make-xmls-builder))
(#(116 101 115 116) NIL)</pre>
<p>
Use a SAX recoder to get strings instead::
</p>
<pre>CL-USER(17): (parse-string "&lt;test/&gt;" (cxml:make-recoder (cxml-xmls:make-xmls-builder) 'runes:rod-string))
("test" NIL)</pre>
<a name="dtdcache"/>
<h3>Caching of DTD Objects</h3>

View File

@ -83,7 +83,7 @@
(defmethod sax:start-element
((handler dom-builder) namespace-uri local-name qname attributes)
(check-type qname rod)
(check-type qname rod) ;catch recoder/builder mismatch
(with-slots (document element-stack) handler
(let* ((nsp sax:*namespace-processing*)
(element (make-instance 'element

View File

@ -1228,7 +1228,9 @@
(when resolver
(setf (document handler) owner)
(push instance (element-stack handler))
(funcall resolver (dom:name instance) handler)))
#+cxml-system::utf8dom-file
(setf handler (cxml:make-recoder handler #'cxml:rod-to-utf8-string))
(funcall resolver (real-rod (dom:name instance)) handler)))
(labels ((walk (n)
(setf (slot-value n 'read-only-p) t)
(when (dom:element-p n)

View File

@ -12,7 +12,11 @@
(handler document
&key (include-xmlns-attributes sax:*include-xmlns-attributes*)
include-doctype
include-default-values)
include-default-values
(recode (typep document 'utf8-dom::node)))
#+rune-is-integer
(when recode
(setf handler (make-recoder handler #'utf8-string-to-rod)))
(sax:start-document handler)
(when include-doctype
(let ((doctype (dom:doctype document)))

View File

@ -175,7 +175,8 @@
(defun read-members (&optional (directory *directory*))
(let* ((pathname (merge-pathnames "build/dom2-interfaces.xml" directory))
(builder (rune-dom:make-dom-builder))
(library (dom:document-element (cxml:parse-file pathname builder)))
(library (dom:document-element
(cxml:parse-file pathname builder :recode nil)))
(methods '())
(fields '()))
(do-child-elements (interface library :name "interface")
@ -584,7 +585,8 @@
(catch 'give-up
(let* ((builder (rune-dom:make-dom-builder))
(cxml::*validate* nil) ;dom1.dtd is buggy
(test (dom:document-element (cxml:parse-file pathname builder)))
(test (dom:document-element
(cxml:parse-file pathname builder :recode nil)))
title
(bindings '())
(code '()))
@ -631,7 +633,8 @@
(setf name (runes:rod-string name))
(cxml:parse-file
(make-pathname :name name :type "xml" :defaults *files-directory*)
(rune-dom:make-dom-builder)))
(rune-dom:make-dom-builder)
:recode nil))
(defparameter *bad-tests*
'("hc_elementnormalize2.xml"
@ -656,7 +659,7 @@
(let* ((all-tests (merge-pathnames "alltests.xml" test-directory))
(builder (rune-dom:make-dom-builder))
(suite (dom:document-element
(cxml:parse-file all-tests builder)))
(cxml:parse-file all-tests builder :recode nil)))
(*files-directory*
(merge-pathnames "files/" test-directory)))
(do-child-elements (member suite)

102
test/utf8domtest.diff Normal file
View File

@ -0,0 +1,102 @@
Index: test/domtest.lisp
===================================================================
RCS file: /project/cxml/cvsroot/cxml/test/domtest.lisp,v
retrieving revision 1.13
diff -u -r1.13 domtest.lisp
--- test/domtest.lisp 27 Dec 2005 00:21:37 -0000 1.13
+++ test/domtest.lisp 27 Dec 2005 00:46:00 -0000
@@ -137,21 +137,22 @@
((digit-char-p (runes:rune-char (elt str 0)))
(parse-integer (runes:rod-string str)))
((runes:rune= (elt str 0) #.(runes:char-rune #\"))
- (let ((v (make-array 1 :fill-pointer 0 :adjustable t)))
- (for* ((i = 1 :then (1+ i))
- (c = (elt str i))
- :until (runes:rune= c #.(runes:char-rune #\")))
- (if (runes:rune= c #.(runes:char-rune #\\))
- (let ((frob
- (progn
- (incf i)
- (elt str i))))
- (ecase frob
- ;; ...
- (#/n (vector-push-extend #/newline v (length v)))
- ((#/\\ #/\") (vector-push-extend #/\\ v (length v)))))
- (vector-push-extend c v (length v))))
- (coerce v 'runes::simple-rod)))
+ (utf8-dom::%rod
+ (let ((v (make-array 1 :fill-pointer 0 :adjustable t)))
+ (for* ((i = 1 :then (1+ i))
+ (c = (elt str i))
+ :until (runes:rune= c #.(runes:char-rune #\")))
+ (if (runes:rune= c #.(runes:char-rune #\\))
+ (let ((frob
+ (progn
+ (incf i)
+ (elt str i))))
+ (ecase frob
+ ;; ...
+ (#/n (vector-push-extend #/newline v (length v)))
+ ((#/\\ #/\") (vector-push-extend #/\\ v (length v)))))
+ (vector-push-extend c v (length v))))
+ (coerce v 'runes::simple-rod))))
(t
(%intern str))))
@@ -368,7 +369,7 @@
(defun translate-implementation (elt)
(with-attributes (|var|) elt
- (maybe-setf (%intern |var|) `'rune-dom:implementation)))
+ (maybe-setf (%intern |var|) `'utf8-dom:implementation)))
(defun translate-length (load)
;; XXX Soweit ich sehe unterscheiden die Tests nicht zwischen
@@ -406,7 +407,7 @@
(if (nullify |obj|)
(translate-member element)
(maybe-setf (%intern |var|)
- `(dom:has-feature 'rune-dom:implementation
+ `(dom:has-feature 'utf8-dom:implementation
,(parse-java-literal |feature|)
,(parse-java-literal |version|))))))
@@ -493,9 +494,9 @@
(return
`(block assert-domexception
(handler-bind
- ((rune-dom::dom-exception
+ ((utf8-dom::dom-exception
(lambda (c)
- (when (eq (rune-dom::dom-exception-key c)
+ (when (eq (utf8-dom::dom-exception-key c)
,(intern (tag-name c) :keyword))
(return-from assert-domexception)))))
,@(translate-body c)
@@ -506,7 +507,7 @@
,@(map-child-elements
'list
(lambda (exception)
- `(when (eq (rune-dom::dom-exception-key c)
+ `(when (eq (utf8-dom::dom-exception-key c)
,(intern (runes:rod-string (dom:get-attribute exception "code"))
:keyword))
,@(translate-body exception)
@@ -516,7 +517,7 @@
(defun translate-try (element)
`(block try
(handler-bind
- ((rune-dom::dom-exception
+ ((utf8-dom::dom-exception
,(translate-catch
(do-child-elements (c element :name "catch") (return c))
'(return-from try))))
@@ -631,7 +632,7 @@
(setf name (runes:rod-string name))
(cxml:parse-file
(make-pathname :name name :type "xml" :defaults *files-directory*)
- (rune-dom:make-dom-builder)))
+ (cxml:make-recoder (utf8-dom:make-dom-builder) 'cxml:rod-to-utf8-string)))
(defparameter *bad-tests*
'("hc_elementnormalize2.xml"

View File

@ -77,7 +77,7 @@
(defun run-all-tests (directory)
(let* ((pathname (merge-pathnames "xmlconf.xml" directory))
(builder (rune-dom:make-dom-builder))
(xmlconf (cxml:parse-file pathname builder))
(xmlconf (cxml:parse-file pathname builder :recode nil))
(ntried 0)
(nfailed 0)
(nskipped 0)
@ -125,6 +125,7 @@
(let ((document (apply #'cxml:parse-file
pathname
(rune-dom:make-dom-builder)
:recode nil
args)))
(cond
((null output)
@ -161,7 +162,10 @@
(handler-case
(progn
(format t " [validating:]")
(cxml:parse-file pathname (rune-dom:make-dom-builder) :validate t)
(cxml:parse-file pathname
(rune-dom:make-dom-builder)
:recode nil
:validate t)
(error "validity error not detected")
nil)
(cxml:validity-error ()
@ -174,7 +178,10 @@
(handler-case
(progn
(format t " [not validating:]")
(cxml:parse-file pathname (rune-dom:make-dom-builder) :validate nil)
(cxml:parse-file pathname
(rune-dom:make-dom-builder)
:recode nil
:validate nil)
(error "well-formedness violation not detected")
nil)
(cxml:well-formedness-violation ()
@ -183,7 +190,10 @@
(handler-case
(progn
(format t " [validating:]")
(cxml:parse-file pathname (rune-dom:make-dom-builder) :validate t)
(cxml:parse-file pathname
(rune-dom:make-dom-builder)
:recode nil
:validate t)
(error "well-formedness violation not detected")
nil)
(cxml:well-formedness-violation ()

View File

@ -75,4 +75,6 @@
#:make-recoder
#:sax-proxy
#:proxy-chained-handler
#:make-namespace-normalizer))
#:make-namespace-normalizer
#:rod-to-utf8-string
#:utf8-string-to-rod))

View File

@ -74,6 +74,9 @@
(%string public-id)
(%string system-id))
(defwrapper sax:start-internal-subset ())
(defwrapper sax:end-internal-subset ())
(defwrapper sax:end-dtd ())
(defwrapper sax:unparsed-entity-declaration

View File

@ -2552,13 +2552,18 @@
(defun p/document
(input handler
&key validate dtd root entity-resolver disallow-internal-subset)
&key validate dtd root entity-resolver disallow-internal-subset
(recode t))
;; check types of user-supplied arguments for better error messages:
(check-type validate boolean)
(check-type recode boolean)
(check-type dtd (or null extid))
(check-type root (or null rod))
(check-type entity-resolver (or null function symbol))
(check-type disallow-internal-subset boolean)
#+rune-is-integer
(when recode
(setf handler (make-recoder handler #'rod-to-utf8-string)))
(let ((*ctx*
(make-context :handler handler
:main-zstream input