utf8-dom fixes.

recoding nach utf-8 jetzt der default.
This commit is contained in:
dlichteblau
2005-12-27 01:35:13 +00:00
parent 42987f5dba
commit dbb2732913
12 changed files with 191 additions and 59 deletions

View File

@ -22,10 +22,33 @@
To parse an XML document into a DOM tree, use the SAX parser with a To parse an XML document into a DOM tree, use the SAX parser with a
DOM builder as the SAX handler. Example: DOM builder as the SAX handler. Example:
</p> </p>
<pre>(cxml:parse-file "test.xml" (dom:make-dom-builder))</pre> <pre>(cxml:parse-file "test.xml" (cxml-dom:make-dom-builder))</pre>
<p> <p>
<div class="def">Function DOM:MAKE-DOM-BUILDER ()</div> <div class="def">Function CXML-DOM:MAKE-DOM-BUILDER ()</div>
Create a SAX handler which builds a DOM document. Create a SAX handler which builds a DOM document.
<p>
</p>
This functions returns a DOM builder that will work with the default
configuration of the SAX parser and is guaranteed to use
characters/strings instead of runes/rods, if that makes a
difference on the Lisp in question.
<p>
</p>
This is the same as <tt>rune-dom:make-dom-builder</tt> on Lisps
with Unicode support, and the same as
<tt>utf8-dom:make-dom-builder</tt> otherwise.
</p>
<p>
<div class="def">Function RUNE-DOM:MAKE-DOM-BUILDER ()</div>
Create a SAX handler which builds a DOM document using runes and rods.
</p>
<p>
<div class="def">Function UTF8-DOM:MAKE-DOM-BUILDER ()</div>
(Only on Lisps without Unicode support:)
Create a SAX handler which builds a DOM document using
UTF-8-encoded strings.
</p> </p>
<a name="serialization"/> <a name="serialization"/>
@ -63,6 +86,12 @@
<tt>include-default-values</tt> -- include attribute nodes with nil <tt>include-default-values</tt> -- include attribute nodes with nil
<tt>dom:specified</tt>. <tt>dom:specified</tt>.
</li> </li>
<li>
<tt>recode</tt> -- (ignored on Lisps with Unicode support.) If
true, recode UTF-8 strings to rods. Defaults to true if used
with a UTF-8 DOM document. It can be set to false manually to
suppress recoding in this case.
</li>
</ul> </ul>
<p> <p>

View File

@ -33,7 +33,7 @@
<li><a href="using.html#parser">Parsing and Validating</a></li> <li><a href="using.html#parser">Parsing and Validating</a></li>
<li><a href="using.html#serialization">Serialization</a></li> <li><a href="using.html#serialization">Serialization</a></li>
<li><a href="using.html#misc">Miscellaneous SAX handlers</a></li> <li><a href="using.html#misc">Miscellaneous SAX handlers</a></li>
<li><a href="using.html#rods">Dealing with Rods</a></li> <li><a href="using.html#rods">Recoders</a></li>
<li><a href="using.html#dtdcache">Caching of DTD Objects</a></li> <li><a href="using.html#dtdcache">Caching of DTD Objects</a></li>
<li><a href="using.html#catalogs">XML Catalogs</a></li> <li><a href="using.html#catalogs">XML Catalogs</a></li>
<li><a href="using.html#sax">SAX Interface</a></li> <li><a href="using.html#sax">SAX Interface</a></li>
@ -67,7 +67,7 @@
<p>Parse <tt>example.xml</tt> into a DOM tree (<a href="using.html#parser">read <p>Parse <tt>example.xml</tt> into a DOM tree (<a href="using.html#parser">read
more</a>):</p> more</a>):</p>
<pre>* <b>(cxml:parse-file "example.xml" (dom:make-dom-builder))</b> <pre>* <b>(cxml:parse-file "example.xml" (cxml-dom:make-dom-builder))</b>
#&lt;DOM-IMPL::DOCUMENT @ #x72206172> #&lt;DOM-IMPL::DOCUMENT @ #x72206172>
;; save result for later: ;; save result for later:
* <b>(defparameter *example* *)</b> * <b>(defparameter *example* *)</b>

View File

@ -69,6 +69,13 @@
<tt>disallow-internal-subset</tt> -- a boolean. If true, signal <tt>disallow-internal-subset</tt> -- a boolean. If true, signal
an error if the document contains an internal subset. an error if the document contains an internal subset.
</li> </li>
<li>
<tt>recode</tt> -- a boolean. (Ignored on Lisps with Unicode
support.) Recode rods to UTF-8 strings. Defaults to true.
Make sure to use <tt>utf8-dom:make-dom-builder</tt> if this
option is enabled and <tt>rune-dom:make-dom-builder</tt>
otherwise.
</li>
</ul> </ul>
<p> <p>
@ -258,7 +265,7 @@
ignored.<br/> ignored.<br/>
Example: Example:
</p> </p>
<pre>(let ((d (parse-file "~/test.xml" (dom:make-dom-builder))) <pre>(let ((d (parse-file "~/test.xml" (cxml-dom:make-dom-builder)))
(x (parse-dtd-file "~/test.dtd"))) (x (parse-dtd-file "~/test.dtd")))
(dom:map-document (cxml:make-validator x #"foo") d))</pre> (dom:map-document (cxml:make-validator x #"foo") d))</pre>
@ -287,40 +294,15 @@
</p> </p>
<a name="rods"/> <a name="rods"/>
<h3>Dealing with Rods</h3> <h3>Recoders</h3>
<p> <p>
As explained above, the XML parser handles character encoding and Recoders are a mechanism used by CXML internally on Lisp implementations
uses 16bit strings internally. Instead of using characters and strings without Unicode support to recode UTF-16 vectors (rods) of
it uses <em>runes</em> and <em>rods</em>. This is seen as a integers (runes) into UTF-8 strings.
feature, but can be inconvenient.
</p> </p>
<ul>
<li>
If your Lisp supports 16 bit unicode strings, use feature
<tt>:rune-is-character</tt> and forget about runes and rods.
CXML will use ordinary Lisp characters and strings both
internally and externally.
</li>
<li>
If your Lisp does not support such strings and your application
needs Unicode support, use functions defined in the
<tt>runes</tt> package instead of ordinary string operators.
</li>
<li>
If your Lisp does not support such strings and your application
does not need Unicode support anyway, it will probably be more
convenient to let CXML convert rods into strings automatically.
To do that, use <tt>cxml:make-recoder</tt> to chain a special
sax handler between the parser and your application handler.
The recoder translates all rods using an application defined
function, which defaults to <tt>runes:rod-string</tt>. Although
the actual XML parser still uses rods internally, you SAX
handler will only see ordinary Lisp strings.
</li>
</ul>
<p> <p>
Note that the recoder approach does <em>not</em> work with the DOM User code does not usually need to deal with recoders in current
builder, since DOM is specified to use UTF-16. versions of CXML.
</p> </p>
<p> <p>
<div class="def">Function CXML:MAKE-RECODER (chained-handler recoder-fn)</div> <div class="def">Function CXML:MAKE-RECODER (chained-handler recoder-fn)</div>
@ -328,16 +310,6 @@
<tt>chained-handler</tt> after converting all strings and rods <tt>chained-handler</tt> after converting all strings and rods
using <tt>recoder-fn</tt>, a function of one argument. using <tt>recoder-fn</tt>, a function of one argument.
</p> </p>
<p>
<b>Example.</b> In a Lisp which ordinarily would use octet vector rods:
</p>
<pre>CL-USER(14): (cxml:parse-string "&lt;test/&gt;" (cxml-xmls:make-xmls-builder))
(#(116 101 115 116) NIL)</pre>
<p>
Use a SAX recoder to get strings instead::
</p>
<pre>CL-USER(17): (parse-string "&lt;test/&gt;" (cxml:make-recoder (cxml-xmls:make-xmls-builder) 'runes:rod-string))
("test" NIL)</pre>
<a name="dtdcache"/> <a name="dtdcache"/>
<h3>Caching of DTD Objects</h3> <h3>Caching of DTD Objects</h3>

View File

@ -83,7 +83,7 @@
(defmethod sax:start-element (defmethod sax:start-element
((handler dom-builder) namespace-uri local-name qname attributes) ((handler dom-builder) namespace-uri local-name qname attributes)
(check-type qname rod) (check-type qname rod) ;catch recoder/builder mismatch
(with-slots (document element-stack) handler (with-slots (document element-stack) handler
(let* ((nsp sax:*namespace-processing*) (let* ((nsp sax:*namespace-processing*)
(element (make-instance 'element (element (make-instance 'element

View File

@ -1228,7 +1228,9 @@
(when resolver (when resolver
(setf (document handler) owner) (setf (document handler) owner)
(push instance (element-stack handler)) (push instance (element-stack handler))
(funcall resolver (dom:name instance) handler))) #+cxml-system::utf8dom-file
(setf handler (cxml:make-recoder handler #'cxml:rod-to-utf8-string))
(funcall resolver (real-rod (dom:name instance)) handler)))
(labels ((walk (n) (labels ((walk (n)
(setf (slot-value n 'read-only-p) t) (setf (slot-value n 'read-only-p) t)
(when (dom:element-p n) (when (dom:element-p n)

View File

@ -12,7 +12,11 @@
(handler document (handler document
&key (include-xmlns-attributes sax:*include-xmlns-attributes*) &key (include-xmlns-attributes sax:*include-xmlns-attributes*)
include-doctype include-doctype
include-default-values) include-default-values
(recode (typep document 'utf8-dom::node)))
#+rune-is-integer
(when recode
(setf handler (make-recoder handler #'utf8-string-to-rod)))
(sax:start-document handler) (sax:start-document handler)
(when include-doctype (when include-doctype
(let ((doctype (dom:doctype document))) (let ((doctype (dom:doctype document)))

View File

@ -175,7 +175,8 @@
(defun read-members (&optional (directory *directory*)) (defun read-members (&optional (directory *directory*))
(let* ((pathname (merge-pathnames "build/dom2-interfaces.xml" directory)) (let* ((pathname (merge-pathnames "build/dom2-interfaces.xml" directory))
(builder (rune-dom:make-dom-builder)) (builder (rune-dom:make-dom-builder))
(library (dom:document-element (cxml:parse-file pathname builder))) (library (dom:document-element
(cxml:parse-file pathname builder :recode nil)))
(methods '()) (methods '())
(fields '())) (fields '()))
(do-child-elements (interface library :name "interface") (do-child-elements (interface library :name "interface")
@ -584,7 +585,8 @@
(catch 'give-up (catch 'give-up
(let* ((builder (rune-dom:make-dom-builder)) (let* ((builder (rune-dom:make-dom-builder))
(cxml::*validate* nil) ;dom1.dtd is buggy (cxml::*validate* nil) ;dom1.dtd is buggy
(test (dom:document-element (cxml:parse-file pathname builder))) (test (dom:document-element
(cxml:parse-file pathname builder :recode nil)))
title title
(bindings '()) (bindings '())
(code '())) (code '()))
@ -631,7 +633,8 @@
(setf name (runes:rod-string name)) (setf name (runes:rod-string name))
(cxml:parse-file (cxml:parse-file
(make-pathname :name name :type "xml" :defaults *files-directory*) (make-pathname :name name :type "xml" :defaults *files-directory*)
(rune-dom:make-dom-builder))) (rune-dom:make-dom-builder)
:recode nil))
(defparameter *bad-tests* (defparameter *bad-tests*
'("hc_elementnormalize2.xml" '("hc_elementnormalize2.xml"
@ -656,7 +659,7 @@
(let* ((all-tests (merge-pathnames "alltests.xml" test-directory)) (let* ((all-tests (merge-pathnames "alltests.xml" test-directory))
(builder (rune-dom:make-dom-builder)) (builder (rune-dom:make-dom-builder))
(suite (dom:document-element (suite (dom:document-element
(cxml:parse-file all-tests builder))) (cxml:parse-file all-tests builder :recode nil)))
(*files-directory* (*files-directory*
(merge-pathnames "files/" test-directory))) (merge-pathnames "files/" test-directory)))
(do-child-elements (member suite) (do-child-elements (member suite)

102
test/utf8domtest.diff Normal file
View File

@ -0,0 +1,102 @@
Index: test/domtest.lisp
===================================================================
RCS file: /project/cxml/cvsroot/cxml/test/domtest.lisp,v
retrieving revision 1.13
diff -u -r1.13 domtest.lisp
--- test/domtest.lisp 27 Dec 2005 00:21:37 -0000 1.13
+++ test/domtest.lisp 27 Dec 2005 00:46:00 -0000
@@ -137,21 +137,22 @@
((digit-char-p (runes:rune-char (elt str 0)))
(parse-integer (runes:rod-string str)))
((runes:rune= (elt str 0) #.(runes:char-rune #\"))
- (let ((v (make-array 1 :fill-pointer 0 :adjustable t)))
- (for* ((i = 1 :then (1+ i))
- (c = (elt str i))
- :until (runes:rune= c #.(runes:char-rune #\")))
- (if (runes:rune= c #.(runes:char-rune #\\))
- (let ((frob
- (progn
- (incf i)
- (elt str i))))
- (ecase frob
- ;; ...
- (#/n (vector-push-extend #/newline v (length v)))
- ((#/\\ #/\") (vector-push-extend #/\\ v (length v)))))
- (vector-push-extend c v (length v))))
- (coerce v 'runes::simple-rod)))
+ (utf8-dom::%rod
+ (let ((v (make-array 1 :fill-pointer 0 :adjustable t)))
+ (for* ((i = 1 :then (1+ i))
+ (c = (elt str i))
+ :until (runes:rune= c #.(runes:char-rune #\")))
+ (if (runes:rune= c #.(runes:char-rune #\\))
+ (let ((frob
+ (progn
+ (incf i)
+ (elt str i))))
+ (ecase frob
+ ;; ...
+ (#/n (vector-push-extend #/newline v (length v)))
+ ((#/\\ #/\") (vector-push-extend #/\\ v (length v)))))
+ (vector-push-extend c v (length v))))
+ (coerce v 'runes::simple-rod))))
(t
(%intern str))))
@@ -368,7 +369,7 @@
(defun translate-implementation (elt)
(with-attributes (|var|) elt
- (maybe-setf (%intern |var|) `'rune-dom:implementation)))
+ (maybe-setf (%intern |var|) `'utf8-dom:implementation)))
(defun translate-length (load)
;; XXX Soweit ich sehe unterscheiden die Tests nicht zwischen
@@ -406,7 +407,7 @@
(if (nullify |obj|)
(translate-member element)
(maybe-setf (%intern |var|)
- `(dom:has-feature 'rune-dom:implementation
+ `(dom:has-feature 'utf8-dom:implementation
,(parse-java-literal |feature|)
,(parse-java-literal |version|))))))
@@ -493,9 +494,9 @@
(return
`(block assert-domexception
(handler-bind
- ((rune-dom::dom-exception
+ ((utf8-dom::dom-exception
(lambda (c)
- (when (eq (rune-dom::dom-exception-key c)
+ (when (eq (utf8-dom::dom-exception-key c)
,(intern (tag-name c) :keyword))
(return-from assert-domexception)))))
,@(translate-body c)
@@ -506,7 +507,7 @@
,@(map-child-elements
'list
(lambda (exception)
- `(when (eq (rune-dom::dom-exception-key c)
+ `(when (eq (utf8-dom::dom-exception-key c)
,(intern (runes:rod-string (dom:get-attribute exception "code"))
:keyword))
,@(translate-body exception)
@@ -516,7 +517,7 @@
(defun translate-try (element)
`(block try
(handler-bind
- ((rune-dom::dom-exception
+ ((utf8-dom::dom-exception
,(translate-catch
(do-child-elements (c element :name "catch") (return c))
'(return-from try))))
@@ -631,7 +632,7 @@
(setf name (runes:rod-string name))
(cxml:parse-file
(make-pathname :name name :type "xml" :defaults *files-directory*)
- (rune-dom:make-dom-builder)))
+ (cxml:make-recoder (utf8-dom:make-dom-builder) 'cxml:rod-to-utf8-string)))
(defparameter *bad-tests*
'("hc_elementnormalize2.xml"

View File

@ -77,7 +77,7 @@
(defun run-all-tests (directory) (defun run-all-tests (directory)
(let* ((pathname (merge-pathnames "xmlconf.xml" directory)) (let* ((pathname (merge-pathnames "xmlconf.xml" directory))
(builder (rune-dom:make-dom-builder)) (builder (rune-dom:make-dom-builder))
(xmlconf (cxml:parse-file pathname builder)) (xmlconf (cxml:parse-file pathname builder :recode nil))
(ntried 0) (ntried 0)
(nfailed 0) (nfailed 0)
(nskipped 0) (nskipped 0)
@ -125,6 +125,7 @@
(let ((document (apply #'cxml:parse-file (let ((document (apply #'cxml:parse-file
pathname pathname
(rune-dom:make-dom-builder) (rune-dom:make-dom-builder)
:recode nil
args))) args)))
(cond (cond
((null output) ((null output)
@ -161,7 +162,10 @@
(handler-case (handler-case
(progn (progn
(format t " [validating:]") (format t " [validating:]")
(cxml:parse-file pathname (rune-dom:make-dom-builder) :validate t) (cxml:parse-file pathname
(rune-dom:make-dom-builder)
:recode nil
:validate t)
(error "validity error not detected") (error "validity error not detected")
nil) nil)
(cxml:validity-error () (cxml:validity-error ()
@ -174,7 +178,10 @@
(handler-case (handler-case
(progn (progn
(format t " [not validating:]") (format t " [not validating:]")
(cxml:parse-file pathname (rune-dom:make-dom-builder) :validate nil) (cxml:parse-file pathname
(rune-dom:make-dom-builder)
:recode nil
:validate nil)
(error "well-formedness violation not detected") (error "well-formedness violation not detected")
nil) nil)
(cxml:well-formedness-violation () (cxml:well-formedness-violation ()
@ -183,7 +190,10 @@
(handler-case (handler-case
(progn (progn
(format t " [validating:]") (format t " [validating:]")
(cxml:parse-file pathname (rune-dom:make-dom-builder) :validate t) (cxml:parse-file pathname
(rune-dom:make-dom-builder)
:recode nil
:validate t)
(error "well-formedness violation not detected") (error "well-formedness violation not detected")
nil) nil)
(cxml:well-formedness-violation () (cxml:well-formedness-violation ()

View File

@ -75,4 +75,6 @@
#:make-recoder #:make-recoder
#:sax-proxy #:sax-proxy
#:proxy-chained-handler #:proxy-chained-handler
#:make-namespace-normalizer)) #:make-namespace-normalizer
#:rod-to-utf8-string
#:utf8-string-to-rod))

View File

@ -74,6 +74,9 @@
(%string public-id) (%string public-id)
(%string system-id)) (%string system-id))
(defwrapper sax:start-internal-subset ())
(defwrapper sax:end-internal-subset ())
(defwrapper sax:end-dtd ()) (defwrapper sax:end-dtd ())
(defwrapper sax:unparsed-entity-declaration (defwrapper sax:unparsed-entity-declaration

View File

@ -2552,13 +2552,18 @@
(defun p/document (defun p/document
(input handler (input handler
&key validate dtd root entity-resolver disallow-internal-subset) &key validate dtd root entity-resolver disallow-internal-subset
(recode t))
;; check types of user-supplied arguments for better error messages: ;; check types of user-supplied arguments for better error messages:
(check-type validate boolean) (check-type validate boolean)
(check-type recode boolean)
(check-type dtd (or null extid)) (check-type dtd (or null extid))
(check-type root (or null rod)) (check-type root (or null rod))
(check-type entity-resolver (or null function symbol)) (check-type entity-resolver (or null function symbol))
(check-type disallow-internal-subset boolean) (check-type disallow-internal-subset boolean)
#+rune-is-integer
(when recode
(setf handler (make-recoder handler #'rod-to-utf8-string)))
(let ((*ctx* (let ((*ctx*
(make-context :handler handler (make-context :handler handler
:main-zstream input :main-zstream input