Files
CXML/xml/xml-parse.lisp
2005-03-13 18:02:10 +00:00

2825 lines
100 KiB
Common Lisp

;;; -*- Mode: Lisp; Syntax: Common-Lisp; Package: XML; readtable: glisp; Encoding: utf-8; -*-
;;; ---------------------------------------------------------------------------
;;; Title: A prototype XML parser
;;; Created: 1999-07-17
;;; Author: Gilbert Baumann <unk6@rz.uni-karlsruhe.de>
;;; License: LGPL (See file COPYING for details).
;;; ---------------------------------------------------------------------------
;;; © copyright 1999 by Gilbert Baumann
;;; This library is free software; you can redistribute it and/or
;;; modify it under the terms of the GNU Library General Public
;;; License as published by the Free Software Foundation; either
;;; version 2 of the License, or (at your option) any later version.
;;;
;;; This library is distributed in the hope that it will be useful,
;;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
;;; Library General Public License for more details.
;;;
;;; You should have received a copy of the GNU Library General Public
;;; License along with this library; if not, write to the
;;; Free Software Foundation, Inc., 59 Temple Place - Suite 330,
;;; Boston, MA 02111-1307 USA.
;;; Streams
;;; xstreams
;; For reading runes, I defined my own streams, called xstreams,
;; because we want to be fast. A function call or even a method call
;; per character is not acceptable, instead of that we define a
;; buffered stream with and advertised buffer layout, so that we
;; could use the trick stdio uses: READ-RUNE and PEEK-RUNE are macros,
;; directly accessing the buffer and only calling some underflow
;; handler in case of stream underflows. This will yield to quite a
;; performance boost vs calling READ-BYTE per character.
;; Also we need to do encoding and character set conversion on input,
;; this better done at large chunks of data rather than on a character
;; by character basis. This way we need a dispatch on the active
;; encoding only once in a while, instead of for each character. This
;; allows us to use a CLOS interface to do the underflow handling.
;;; zstreams
;; Now, for reading tokens, we define another kind of streams, called
;; zstreams. These zstreams also maintain an input stack to implement
;; inclusion of external entities. This input stack contains xstreams
;; or the special marker :STOP. Such a :STOP marker indicates, that
;; input should not continue there, but well stop; that is simulate an
;; EOF. The user is then responsible to pop this marker off the input
;; stack.
;;
;; This input stack is also used to detect circular entity inclusion.
;; The zstream tokenizer recognizes the following types of tokens and
;; is controlled by the *DATA-BEHAVIOUR* flag. (Which should become a
;; slot of zstreams instead).
;; Common
;; :xml-pi (<target> . <content>) ;processing-instruction starting with "<?xml"
;; :pi (<target> . <content>) ;processing-instruction
;; :stag (<name> . <atts>) ;start tag
;; :etag (<name> . <atts>) ;end tag
;; :ztag (<name> . <atts>) ;empty tag
;; :<!element
;; :<!entity
;; :<!attlist
;; :<!notation
;; :<!doctype
;; :<![
;; :comment <content>
;; *data-behaviour* = :DTD
;;
;; :name <interned-rod>
;; :#required
;; :#implied
;; :#fixed
;; :#pcdata
;; :s
;; :\[ :\] :\( :\) :|\ :\> :\" :\' :\, :\? :\* :\+
;; *data-behaviour* = :DOC
;;
;; :entity-ref <interned-rod>
;; :cdata <rod>
;;; NOTES
;;
;; Stream buffers as well as RODs are supposed to be encoded in
;; UTF-16.
;; where does the time go?
;; DATA-RUNE-P
;; CANON-NOT-CDATA-ATTVAL
;; READ-ATTVAL (MUFFLE)
;; CLOSy DOM
;; UTF-8 decoder (13%)
;; READ-ATTVAL (10%)
;;
;;; TODO
;;
;; o Improve error messages:
;; - line and column number counters
;; - better texts
;; - better handling of errors (no crash'n burn behaviour)
;;
;; o provide for a faster DOM
;;
;; o parse document should get passed a document instance, so that a user
;; could pass his own DOM implementation
;;
;; o morph zstream into a context object and thus also get rid of
;; special variables. Put the current DTD there too.
;; o the *scratch-pad* hack should become something much more
;; reentrant, we could either define a system-wide resource
;; or allocate some scratch-pads per context.
;; o only parse the DTD on an option
;; o make the invalid tests pass.
;;
;; o CR handling in utf-16 deocders
;;
;; o UCS-4 reader
;;
;; o max depth together with circle detection
;; (or proof, that our circle detection is enough).
;;
;; o element definitions (with att definitions in the elements)
;;
;; o store entities in the DTD
;;
;; o better extensibility wrt character representation, one may want to
;; have
;; - UTF-8 in standard CL strings
;; - UCS-2 in RODs
;; - UTF-16 in RODs
;; - UCS-4 in vectoren
;;
;; o xstreams auslagern, documententieren und dann auch in SGML und
;; CSS parser verwenden. (halt alles was zeichen liest).
;;
;; o merge node representation with SGML module
;;
;; o namespaces (this will get ugly).
;;
;; o validation
;;
;; o line/column number recording
;;
;; o better error messages
;;
;; o recording of source locations for nodes.
;;
;; o make the *scratch-pad* hack safe
;;
;; o based on the DTD and xml:space attribute implement HTML white
;; space rules.
;;
;; o on a parser option, do not expand external entities.
;;
;; o on a parser option, do not parse the DTD.
;;
;; o caching DTDs?
;;
;; That is, if we parse a lot of documents all having the same DTD,
;; we do not need to re-read it every time.
;; But watch the file write date, since not doing so would be
;; a good way to confuse a hell lot of users.
;; But: What to do with declarations in the <!DOCTYPE header?
;;
;;
;; o does the user need the distinction between "&#20;" and " " ?
;; That is literal and 'quoted' white space.
;;
;; o on an option merge CDATA section;
;;
;; o data in parse tree? extra nodes like in SGML?!
;;
;; o what to store in the node-gi field? Some name object or the
;; string used?
;;
;; Test that fail:
;;
;; not-wf/sa/128 is false a alarm
;;
(in-package :xml)
#+ALLEGRO
(setf (excl:named-readtable :glisp) *readtable*)
(eval-when (eval compile load)
(defparameter *fast* '(optimize (speed 3) (safety 0)))
;;(defparameter *fast* '(optimize (speed 2) (safety 3)))
)
(defvar *expand-pe-p*)
;;;; ---------------------------------------------------------------------------
;;;; xstreams
;;;;
(defstruct (stream-name (:type list))
entity-name
entity-kind
file-name)
(defun print-xstream (self sink depth)
(declare (ignore depth))
(format sink "#<~S ~S>" (type-of self) (mu (xstream-name self))))
(deftype read-element () 'rune)
;; (unsigned-byte 16)) ;;t)
(defmethod figure-encoding ((stream null))
(values :utf-8 nil))
(defmethod figure-encoding ((stream stream))
(let ((c0 (read-byte stream nil :eof)))
(cond ((eq c0 :eof)
(values :utf-8 nil))
(t
(let ((c1 (read-byte stream nil :eof)))
(cond ((eq c1 :eof)
(values :utf-8 (list c0)))
(t
(cond ((and (= c0 #xFE) (= c1 #xFF)) (values :utf-16-big-endian nil))
((and (= c0 #xFF) (= c1 #xFE)) (values :utf-16-little-endian nil))
(t
(values :utf-8 (list c0 c1)))))))))))
(defun call-with-open-xstream (continuation &rest open-args)
(let ((input (apply #'open (car open-args) :element-type '(unsigned-byte 8) (cdr open-args))))
(unwind-protect
(progn
(funcall continuation (make-xstream input)))
(close input))))
(defmacro with-open-xstream ((stream &rest open-args) &body body)
`(call-with-open-xstream (lambda (,stream) .,body) .,open-args))
;;; Decoders
;; The decoders share a common signature:
;;
;; DECODE input input-start input-end
;; output output-start output-end
;; eof-p
;; -> first-not-written ; first-not-read
;;
;; These decode functions should decode as much characters off `input'
;; into the `output' as possible and return the indexes to the first
;; not read and first not written element of `input' and `output'
;; respectively. If there are not enough bytes in `input' to decode a
;; full character, decoding shold be abandomed; the caller has to
;; ensure that the remaining bytes of `input' are passed to the
;; decoder again with more bytes appended.
;;
;; `eof-p' now in turn indicates, if the given input sequence, is all
;; the producer does have and might be used to produce error messages
;; in case of incomplete codes or decided what to do.
;;
;; Decoders are expected to handle the various CR/NL conventions and
;; canonicalize each end of line into a single NL rune (#xA) in good
;; old Lisp tradition.
;;
;; TODO: change this to an encoding class, which then might carry
;; additional state. Stateless encodings could been represented by
;; keywords. e.g.
;;
;; defmethod DECODE-SEQUENCE ((encoding (eql :utf-8)) ...)
;;
;;;; ---------------------------------------------------------------------------
;;;; rod hashtable
;;;;
;;; make-rod-hashtable
;;; rod-hash-get hashtable rod &optional start end -> value ; successp
;;; (setf (rod-hash-get hashtable rod &optional start end) new-value
;;;
(defstruct (rod-hashtable (:constructor make-rod-hashtable/low))
size ;size of table
table ;
)
(defun make-rod-hashtable (&key (size 200))
(setf size (glisp::nearest-greater-prime size))
(make-rod-hashtable/low
:size size
:table (make-array size :initial-element nil)))
(eval-when (compile eval load)
(defconstant +fixnum-bits+
(1- (integer-length most-positive-fixnum))
"Pessimistic approximation of the number of bits of fixnums.")
(defconstant +fixnum-mask+
(1- (expt 2 +fixnum-bits+))
"Pessimistic approximation of the largest bit-mask, still being a fixnum."))
(defsubst stir (a b)
(%and +fixnum-mask+
(%xor (%ior (%ash (%and a #.(ash +fixnum-mask+ -5)) 5)
(%ash a #.(- 5 +fixnum-bits+)))
b)))
(defsubst rod-hash (rod start end)
"Compute a hash code out of a rod."
(let ((res (%- end start)))
(do ((i start (%+ i 1)))
((%= i end))
(declare (type fixnum i))
(setf res (stir res (%rune rod i))))
res))
(defsubst rod=* (x y &key (start1 0) (end1 (length x))
(start2 0) (end2 (length y)))
(and (%= (%- end1 start1) (%- end2 start2))
(do ((i start1 (%+ i 1))
(j start2 (%+ j 1)))
((%= i end1)
t)
(unless (= (%rune x i) (%rune y j))
(return nil)))))
(defsubst rod=** (x y start1 end1 start2 end2)
(and (%= (%- end1 start1) (%- end2 start2))
(do ((i start1 (%+ i 1))
(j start2 (%+ j 1)))
((%= i end1)
t)
(unless (= (%rune x i) (%rune y j))
(return nil)))))
(defun rod-hash-get (hashtable rod &optional (start 0) (end (length rod)))
(declare (type (simple-array rune (*)) rod))
(let ((j (%mod (rod-hash rod start end)
(rod-hashtable-size hashtable))))
(dolist (q (svref (rod-hashtable-table hashtable) j)
(values nil nil nil))
(declare (type cons q))
(when (rod=** (car q) rod 0 (length (the (simple-array rune (*)) (car q))) start end)
(return (values (cdr q) t (car q)))))))
(defun rod-hash-set (new-value hashtable rod &optional (start 0) (end (length rod)))
(let ((j (%mod (rod-hash rod start end)
(rod-hashtable-size hashtable)))
(key nil))
(dolist (q (svref (rod-hashtable-table hashtable) j)
(progn
(setf key (rod-subseq* rod start end))
(push (cons key new-value)
(aref (rod-hashtable-table hashtable) j))))
(when (rod=* (car q) rod :start2 start :end2 end)
(setf key (car q))
(setf (cdr q) new-value)
(return)))
(values new-value key)))
(defun rod-subseq* (source start &optional (end (length source)))
(unless (and (typep start 'fixnum) (>= start 0))
(error "~S is not a non-negative fixnum." start))
(unless (and (typep end 'fixnum) (>= end start))
(error "END argument, ~S, is not a fixnum no less than START, ~S." end start))
(when (> start (length source))
(error "START argument, ~S, should be no greater than length of rod." start))
(when (> end (length source))
(error "END argument, ~S, should be no greater than length of rod." end))
(locally
(declare (type fixnum start end))
(let ((res (make-rod (- end start))))
(declare (type rod res))
(do ((i (- (- end start) 1) (the fixnum (- i 1))))
((< i 0) res)
(declare (type fixnum i))
(setf (%rune res i) (aref source (the fixnum (+ i start))))))))
(deftype ufixnum () `(unsigned-byte ,(integer-length most-positive-fixnum)))
(defun rod-subseq** (source start &optional (end (length source)))
(declare (type (simple-array rune (*)) source)
(type ufixnum start)
(type ufixnum end)
(optimize (speed 3) (safety 0)))
(let ((res (make-array (%- end start) :element-type 'rune)))
(declare (type (simple-array rune (*)) res))
(let ((i (%- end start)))
(declare (type ufixnum i))
(loop
(setf i (- i 1))
(when (= i 0)
(return))
(setf (%rune res i) (%rune source (the ufixnum (+ i start))))))
res))
(defun (setf rod-hash-get) (new-value hashtable rod &optional (start 0) (end (length rod)))
(rod-hash-set new-value hashtable rod start end))
(defparameter *name-hashtable* (make-rod-hashtable :size 2000))
(defun intern-name (rod &optional (start 0) (end (length rod)))
(multiple-value-bind (value successp key) (rod-hash-get *name-hashtable* rod start end)
(declare (ignore value))
(if successp
key
(nth-value 1 (rod-hash-set t *name-hashtable* rod start end)))))
;;;; ---------------------------------------------------------------------------
;;;;
;;;; rod collector
;;;;
(defparameter *scratch-pad*
(make-array 1024 :element-type 'rune))
(defparameter *scratch-pad-2*
(make-array 1024 :element-type 'rune))
(defparameter *scratch-pad-3*
(make-array 1024 :element-type 'rune))
(defparameter *scratch-pad-4*
(make-array 1024 :element-type 'rune))
(declaim (type (simple-array rune (*))
*scratch-pad* *scratch-pad-2* *scratch-pad-3* *scratch-pad-4*))
(defmacro %put-rune (rune-var put)
`(progn
(cond ((%> ,rune-var #xFFFF)
(,put (the (unsigned-byte 16) (%+ #xD7C0 (ash ,rune-var -10))))
(,put (the (unsigned-byte 16) (%ior #xDC00 (%and ,rune-var #x3FF)))))
(t
(,put ,rune-var)))))
(defun adjust-array-by-copying (old-array new-size)
"Adjust an array by copying and thus ensures, that result is a SIMPLE-ARRAY."
(let ((res (make-array new-size :element-type (array-element-type old-array))))
(replace res old-array
:start1 0 :end1 (length old-array)
:start2 0 :end2 (length old-array))
res))
(defmacro with-rune-collector-aux (scratch collect body mode)
(let ((rod (gensym))
(n (gensym))
(i (gensym))
(b (gensym)))
`(let ((,n (length ,scratch))
(,i 0)
(,b ,scratch))
(declare (type fixnum ,n ,i))
(macrolet
((,collect (x)
`((lambda (x)
(locally
(declare #.*fast*)
(when (%>= ,',i ,',n)
(setf ,',n (* 2 ,',n))
(setf ,',b
(setf ,',scratch
(adjust-array-by-copying ,',scratch ,',n))))
(setf (aref (the (simple-array rune (*)) ,',b) ,',i) x)
(incf ,',i)))
,x)))
,@body
,(ecase mode
(:intern
`(intern-name ,b 0 ,i))
(:copy
`(let ((,rod (make-rod ,i)))
(while (not (%= ,i 0))
(setf ,i (%- ,i 1))
(setf (%rune ,rod ,i)
(aref (the (simple-array rune (*)) ,b) ,i)))
,rod))
(:raw
`(values ,b 0 ,i))
)))))
'(defmacro with-rune-collector-aux (scratch collect body mode)
(let ((rod (gensym))
(n (gensym))
(i (gensym))
(b (gensym)))
`(let ((,n (length ,scratch))
(,i 0))
(declare (type fixnum ,n ,i))
(macrolet
((,collect (x)
`((lambda (x)
(locally
(declare #.*fast*)
(when (%>= ,',i ,',n)
(setf ,',n (* 2 ,',n))
(setf ,',scratch
(setf ,',scratch
(adjust-array-by-copying ,',scratch ,',n))))
(setf (aref (the (simple-array rune (*)) ,',scratch) ,',i) x)
(incf ,',i)))
,x)))
,@body
,(ecase mode
(:intern
`(intern-name ,scratch 0 ,i))
(:copy
`(let ((,rod (make-rod ,i)))
(while (%> ,i 0)
(setf ,i (%- ,i 1))
(setf (%rune ,rod ,i)
(aref (the (simple-array rune (*)) ,scratch) ,i)))
,rod))
(:raw
`(values ,scratch 0 ,i))
)))))
(defmacro with-rune-collector ((collect) &body body)
`(with-rune-collector-aux *scratch-pad* ,collect ,body :copy))
(defmacro with-rune-collector-2 ((collect) &body body)
`(with-rune-collector-aux *scratch-pad-2* ,collect ,body :copy))
(defmacro with-rune-collector-3 ((collect) &body body)
`(with-rune-collector-aux *scratch-pad-3* ,collect ,body :copy))
(defmacro with-rune-collector-4 ((collect) &body body)
`(with-rune-collector-aux *scratch-pad-4* ,collect ,body :copy))
(defmacro with-rune-collector/intern ((collect) &body body)
`(with-rune-collector-aux *scratch-pad* ,collect ,body :intern))
(defmacro with-rune-collector/raw ((collect) &body body)
`(with-rune-collector-aux *scratch-pad* ,collect ,body :raw))
#|
(defmacro while-reading-runes ((reader stream-in) &rest body)
;; Thou shalt not leave body via a non local exit
(let ((stream (make-symbol "STREAM"))
(rptr (make-symbol "RPTR"))
(fptr (make-symbol "FPTR"))
(buf (make-symbol "BUF")) )
`(let* ((,stream ,stream-in)
(,rptr (xstream-read-ptr ,stream))
(,fptr (xstream-fill-ptr ,stream))
(,buf (xstream-buffer ,stream)))
(declare (type fixnum ,rptr ,fptr)
(type xstream ,stream))
(macrolet ((,reader (res-var)
`(cond ((%= ,',rptr ,',fptr)
(setf (xstream-read-ptr ,',stream) ,',rptr)
(setf ,res-var (xstream-underflow ,',stream))
(setf ,',rptr (xstream-read-ptr ,',stream))
(setf ,',fptr (xstream-fill-ptr ,',stream))
(setf ,',buf (xstream-buffer ,',stream)))
(t
(setf ,res-var
(aref (the (simple-array read-element (*)) ,',buf)
(the fixnum ,',rptr)))
(setf ,',rptr (%+ ,',rptr 1))))))
(prog1
(let () .,body)
(setf (xstream-read-ptr ,stream) ,rptr) )))))
|#
;;;; ---------------------------------------------------------------------------
;;;; DTD
;;;;
(defparameter *entities* nil)
(defvar *dtd*)
(defun absolute-uri (sysid source-stream)
(setq sysid (rod-string sysid))
(let ((base-sysid
(dolist (k (zstream-input-stack source-stream))
(let ((base-sysid (stream-name-file-name (xstream-name k))))
(when base-sysid (return base-sysid))))))
(assert (not (null base-sysid)))
(merge-sysid sysid base-sysid)))
(defun absolute-extid (source-stream extid)
(case (car extid)
(:system
(list (car extid)
(absolute-uri (cadr extid) source-stream)))
(:public
(list (car extid)
(cadr extid)
(absolute-uri (caddr extid) source-stream)))))
(defun define-entity (source-stream name kind def)
(when (eq (car def) :external)
(setf def
(list (car def) (absolute-extid source-stream (cadr def)))))
(setf name (intern-name name))
(setf *entities*
(append *entities*
(list (cons (list kind name)
def)))))
#||
(defun define-element (zinput dtd element-name content-model)
;; zinput is for source code location recoding
(let ((elmdef (make-elmdef :name element-name
:content content-model
)))
()))
||#
(defun entity->xstream (entity-name kind &optional zstream)
;; `zstream' is for error messages
(let ((looked (assoc (list kind entity-name) *entities* :test #'equal)))
(unless looked
(if zstream
(perror zstream "Entity '~A' is not defined." (rod-string entity-name))
(error "Entity '~A' is not defined." (rod-string entity-name))))
(let (r)
(ecase (cadr looked)
(:internal
(setf r (make-rod-xstream (caddr looked)))
(setf (xstream-name r)
(make-stream-name :entity-name entity-name
:entity-kind kind
:file-name nil)))
(:external
(setf r (open-extid (caddr looked)))
(setf (stream-name-entity-name (xstream-name r)) entity-name
(stream-name-entity-kind (xstream-name r)) kind)))
r)))
(defun entity-source-kind (name type)
(let ((looked (assoc (list type name) *entities* :test #'equal)))
(unless looked
(error "Entity '~A' is not defined." (rod-string name)))
(cadr looked)))
(defun open-extid (extid)
(let ((nam (ecase (car extid)
(:SYSTEM (cadr extid))
(:PUBLIC (caddr extid)))))
(make-xstream (open-sysid nam)
:name (make-stream-name :file-name nam)
:initial-speed 1)))
(defun call-with-entity-expansion-as-stream (zstream cont name kind)
;; `zstream' is for error messages -- we need something better!
(let ((in (entity->xstream name kind zstream)))
(unwind-protect
(funcall cont in)
(close-xstream in))))
(defun define-default-entities ()
(define-entity nil '#.(string-rod "lt") :general `(:internal #.(string-rod "&#60;")))
(define-entity nil '#.(string-rod "gt") :general `(:internal #.(string-rod ">")))
(define-entity nil '#.(string-rod "amp") :general `(:internal #.(string-rod "&#38;")))
(define-entity nil '#.(string-rod "apos") :general `(:internal #.(string-rod "'")))
(define-entity nil '#.(string-rod "quot") :general `(:internal #.(string-rod "\"")))
;;
#||
(define-entity nil '#.(string-rod "ouml") :general `(:internal #.(string-rod "ö")))
(define-entity nil '#.(string-rod "uuml") :general `(:internal #.(string-rod "ü")))
(define-entity nil '#.(string-rod "auml") :general `(:internal #.(string-rod "ä")))
(define-entity nil '#.(string-rod "Ouml") :general `(:internal #.(string-rod "Ö")))
(define-entity nil '#.(string-rod "Auml") :general `(:internal #.(string-rod "Ä")))
(define-entity nil '#.(string-rod "Uuml") :general `(:internal #.(string-rod "Ü")))
(define-entity nil '#.(string-rod "szlig") :general `(:internal #.(string-rod "ß")))
||#
;;
#||
(define-entity nil '#.(string-rod "nbsp")
:general `(:internal ,(let ((r (make-rod 1)))
(setf (aref r 0) #o240)
r)))
||#
)
(defstruct attdef
;; an attribute definition
element ;name of element this attribute belongs to
name ;name of attribute
type ;type of attribute; either one of :CDATA, :ID, :IDREF, :IDREFS,
; :ENTITY, :ENTITIES, :NMTOKEN, :NMTOKENS, or
; (:NOTATION <name>*)
; (:ENUMERATION <name>*)
default) ;default value of attribute:
; :REQUIRED, :IMPLIED, (:FIXED content) or (:DEFAULT content)
(defstruct elmdef
;; an element definition
name ;name of the element
content ;content model
attributes ;list of defined attribtes
defined-p) ;is this element defined? [*]
;; [*] in XML it is possible to define attributes, before the element
;; itself is defined and since we hang attribute definitions into the
;; relevant element definitions, this flag indicates, whether an
;; element was actually defined.
(defstruct dtd
elements ;hashtable or whatnot of all elements
attdefs ;
gentities ;general entities
pentities ;parameter entities
)
;;;;
(defun define-attribute (dtd element name type default)
(let ((adef (make-attdef :element element
:name name
:type type
:default default)))
(cond ((find-attribute dtd element name)
(warn "Attribute \"~A\" of \"~A\" not redefined."
(rod-string name)
(rod-string element)))
(t
(push adef (dtd-attdefs dtd))))))
(defun find-attribute (dtd element name)
(dolist (k (dtd-attdefs dtd))
(cond ((and (eq element (attdef-element k))
(eq name (attdef-name k)))
(return k)))))
(defun map-all-attdefs-for-element (dtd element continuation)
(declare (dynamic-extent continuation));this does not help under ACL
(dolist (k (dtd-attdefs dtd))
(cond ((eq element (attdef-element k))
(funcall continuation k)))))
;;;; ---------------------------------------------------------------------------
;;;; z streams and lexer
;;;;
(defstruct zstream
token-category
token-semantic
input-stack)
(defun read-token (input)
(cond ((zstream-token-category input)
(multiple-value-prog1
(values (zstream-token-category input)
(zstream-token-semantic input))
(setf (zstream-token-category input) nil
(zstream-token-semantic input) nil)))
(t
(read-token-2 input))))
(defun peek-token (input)
(cond ((zstream-token-category input)
(values
(zstream-token-category input)
(zstream-token-semantic input)))
(t
(multiple-value-bind (c s) (read-token input)
(setf (zstream-token-category input) c
(zstream-token-semantic input) s))
(values (zstream-token-category input)
(zstream-token-semantic input)))))
(defun read-token-2 (input)
(cond ((null (zstream-input-stack input))
(values :eof nil))
(t
(let ((c (peek-rune (car (zstream-input-stack input)))))
(cond ((eq c :eof)
(cond ((eq (cadr (zstream-input-stack input)) :stop)
(values :eof nil))
(t
(close-xstream (pop (zstream-input-stack input)))
(if (null (zstream-input-stack input))
(values :eof nil)
(values :S nil) ;fake #x20 after PE expansion
))))
(t
(read-token-3 input)))))))
(defvar *data-behaviour*
) ;either :DTD or :DOC
(defun read-token-3 (zinput)
(let ((input (car (zstream-input-stack zinput))))
;; PI Comment
(let ((c (read-rune input)))
(cond
;; first the common tokens
((rune= #/< c)
(read-token-after-|<| zinput input))
;; now dispatch
(t
(ecase *data-behaviour*
(:DTD
(cond ((rune= #/\[ c) :\[)
((rune= #/\] c) :\])
((rune= #/\( c) :\()
((rune= #/\) c) :\))
((rune= #/\| c) :\|)
((rune= #/\> c) :\>)
((rune= #/\" c) :\")
((rune= #/\' c) :\')
((rune= #/\, c) :\,)
((rune= #/\? c) :\?)
((rune= #/\* c) :\*)
((rune= #/\+ c) :\+)
((name-rune-p c)
(unread-rune c input)
(values :name (read-name-token input)))
((rune= #/# c)
(let ((q (read-name-token input)))
(cond ((equalp q '#.(string-rod "REQUIRED")) :|#REQUIRED|)
((equalp q '#.(string-rod "IMPLIED")) :|#IMPLIED|)
((equalp q '#.(string-rod "FIXED")) :|#FIXED|)
((equalp q '#.(string-rod "PCDATA")) :|#PCDATA|)
(t
(error "Unknown token: ~S." q)))))
((or (rune= c #/U+0020)
(rune= c #/U+0009)
(rune= c #/U+000D)
(rune= c #/U+000A))
(values :s nil))
((rune= #/% c)
(cond ((name-start-rune-p (peek-rune input))
;; an entity reference
(read-pe-reference zinput))
(t
(values :%))))
(t
(error "Unexpected character ~S." c))))
(:DOC
(cond
((rune= c #/&)
(multiple-value-bind (kind data) (read-entity-ref input)
(cond ((eq kind :named)
(values :entity-ref data) )
((eq kind :numeric)
(values :cdata
(with-rune-collector (collect)
(%put-rune data collect)))))))
(t
(unread-rune c input)
(values :cdata (read-cdata input))) ))))))))
(defun read-pe-reference (zinput)
(let* ((input (car (zstream-input-stack zinput)))
(nam (read-name-token input)))
(assert (rune= #/\; (read-rune input)))
(cond (*expand-pe-p*
;; no external entities here!
(let ((i2 (entity->xstream nam :parameter)))
(zstream-push i2 zinput))
(values :S nil) ;space before inserted PE expansion.
)
(t
(values :pe-reference nam)) )))
(defun read-token-after-|<| (zinput input)
(let ((d (read-rune input)))
(cond ((eq d :eof)
(error "EOF after '<'"))
((rune= #/! d)
(read-token-after-|<!| input))
((rune= #/? d)
(multiple-value-bind (target content) (read-pi input)
(cond ((rod= target '#.(string-rod "xml"))
(values :xml-pi (cons target content)))
((rod-equal target '#.(string-rod "XML"))
(error "You lost -- no XML processing instructions."))
((and sax:*namespace-processing* (position #/: target))
(error "Processing instruction target ~S is not a valid NcName."
(mu target)))
(t
(values :pi (cons target content))))))
((rune= #// d)
(let ((c (peek-rune input)))
(cond ((name-start-rune-p c)
(read-tag-2 zinput input :etag))
(t
(error "Expecting name start rune after \"</\".")))))
((name-start-rune-p d)
(unread-rune d input)
(read-tag-2 zinput input :stag))
(t
(error "Expected '!' or '?' after '<' in DTD.")))))
(defun read-token-after-|<!| (input)
(let ((d (read-rune input)))
(cond ((eq d :eof)
(error "EOF after \"<!\"."))
((name-start-rune-p d)
(unread-rune d input)
(let ((name (read-name-token input)))
(cond ((rod= name '#.(string-rod "ELEMENT")) :|<!ELEMENT|)
((rod= name '#.(string-rod "ENTITY")) :|<!ENTITY|)
((rod= name '#.(string-rod "ATTLIST")) :|<!ATTLIST|)
((rod= name '#.(string-rod "NOTATION")) :|<!NOTATION|)
((rod= name '#.(string-rod "DOCTYPE")) :|<!DOCTYPE|)
(t
(error "`<!~A' unknown." (rod-string name))))))
((rune= #/\[ d)
(values :|<![| nil))
((rune= #/- d)
(setf d (read-rune input))
(cond ((rune= #/- d)
(values
:COMMENT
(read-comment-content input)))
(t
(error "Bad character ~S after \"<!-\"" d))))
(t
(error "Bad character ~S after \"<!\"" d)))))
(defun read-attribute-list (zinput input imagine-space-p)
(cond ((or imagine-space-p
(let ((c (peek-rune input)))
(and (not (eq c :eof))
(space-rune-p c))))
(read-S? input)
(cond ((eq (peek-rune input) :eof)
nil)
((name-start-rune-p (peek-rune input))
(cons (read-attribute zinput input)
(read-attribute-list zinput input nil)))
(t
nil)))
(t
nil)))
(defun read-entity-ref (input)
"Read an entity reference off the xstream `input'. Returns two values:
either :NAMED <interned-rod> in case of a named entity
or :NUMERIC <integer> in case of numeric entities.
The initial #\\& is considered to be consumed already."
(let ((c (peek-rune input)))
(cond ((eq c :eof)
(error "EOF after '&'"))
((rune= c #/#)
(values :numeric (read-numeric-entity input)))
(t
(unless (name-start-rune-p (peek-rune input))
(error "Expecting name after &."))
(let ((name (read-name-token input)))
(setf c (read-rune input))
(unless (rune= c #/\;)
(perror input "Expected \";\"."))
(values :named name))))))
(defsubst read-S? (input)
(while (member (peek-rune input) '(#/U+0020 #/U+0009 #/U+000A #/U+000D)
:test #'eq)
(consume-rune input)))
(defun read-tag-2 (zinput input kind)
(let ((name (read-name-token input))
(atts nil))
(setf atts (read-attribute-list zinput input nil))
;;(setf atts (nreverse atts))
;; care for atts
;;
;;zzz
(let ((fn (lambda (adef &aux x)
(setf x (assoc (attdef-name adef) atts))
(when (and (consp (attdef-default adef))
(eq (car (attdef-default adef)) :default)
(not x))
(setf atts (cons (setf x (cons (attdef-name adef) (cadr (attdef-default adef))))
atts)))
(when (and (consp (attdef-default adef))
(eq (car (attdef-default adef)) :fixed)
(not x))
(setf atts (cons (setf x (cons (attdef-name adef) (cadr (attdef-default adef))))
atts)))
(unless (eq (attdef-type adef) :cdata)
(when x
(setf (cdr x) (canon-not-cdata-attval (cdr x)))))
;; xxx more tests
)))
(declare (dynamic-extent fn))
(map-all-attdefs-for-element
*dtd* name fn))
;; check for double attributes
(do ((q atts (cdr q)))
((null q))
(cond ((find (caar q) (cdr q) :key #'car)
(error "Attribute ~S has two definitions in element ~S."
(rod-string (caar q))
(rod-string name)))))
(cond ((eq (peek-rune input) #/>)
(consume-rune input)
(values kind (cons name atts)))
((eq (peek-rune input) #//)
(consume-rune input)
(assert (rune= #/> (read-rune input)))
(values :ztag (cons name atts)))
(t
(error "syntax error in read-tag-2.")) )))
(defun read-attribute (zinput input)
(unless (name-start-rune-p (peek-rune input))
(error "Expected name."))
;; arg thanks to the post mortem nature of name space declarations,
;; we could only process the attribute values post mortem.
(let ((name (read-name-token input)))
(while (let ((c (peek-rune input)))
(and (not (eq c :eof))
(or (= c #/U+0020)
(= c #/U+0009)
(= c #/U+000A)
(= c #/U+000D))))
(consume-rune input))
(unless (eq (read-rune input) #/=)
(perror zinput "Expected \"=\"."))
(while (let ((c (peek-rune input)))
(and (not (eq c :eof))
(or (= c #/U+0020)
(= c #/U+0009)
(= c #/U+000A)
(= c #/U+000D))))
(consume-rune input))
(cons name (read-att-value-2 input))
;;(cons name (read-att-value zinput input :att t))
))
(defun canon-not-cdata-attval (value)
;; | If the declared value is not CDATA, then the XML processor must
;; | further process the normalized attribute value by discarding any
;; | leading and trailing space (#x20) characters, and by replacing
;; | sequences of space (#x20) characters by a single space (#x20)
;; | character.
(with-rune-collector (collect)
(let ((gimme-20 nil)
(anything-seen-p nil))
(map nil (lambda (c)
(cond ((= c #x20)
(setf gimme-20 t))
(t
(when (and anything-seen-p gimme-20)
(collect #x20))
(setf gimme-20 nil)
(setf anything-seen-p t)
(collect c))))
value))))
#||
(defun canon-not-cdata-attval (value)
;; | If the declared value is not CDATA, then the XML processor must
;; | further process the normalized attribute value by discarding any
;; | leading and trailing space (#x20) characters, and by replacing
;; | sequences of space (#x20) characters by a single space (#x20)
;; | character.
value)
||#
(defsubst data-rune-p (c)
;; any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
(or (= c #x9) (= c #xA) (= c #xD)
(<= #x20 c #xD7FF)
(<= #xE000 c #xFFFD)
;;
(<= #xD800 c #xDBFF)
(<= #xDC00 c #xDFFF)
;;
))
#||
(defsubst data-rune-p (c)
t)
||#
(defun read-att-value (zinput input mode &optional canon-space-p (delim nil))
(with-rune-collector-2 (collect)
(labels ((muffle (input delim)
(let (c)
(loop
(setf c (read-rune input))
(cond ((eql delim c)
(return))
((eq c :eof)
(error "EOF"))
((rune= c #/&)
(setf c (peek-rune input))
(cond ((rune= c #/#)
(let ((c (read-numeric-entity input)))
(%put-rune c collect)))
(t
(unless (name-start-rune-p (peek-rune input))
(error "Expecting name after &."))
(let ((name (read-name-token input)))
(setf c (read-rune input))
(assert (rune= c #/\;))
(ecase mode
(:att
(recurse-on-entity
zinput name :general
(lambda (zinput)
(muffle (car (zstream-input-stack zinput))
:eof))))
(:ent
;; bypass, but never the less we
;; need to check for legal
;; syntax.
;; Must it be defined?
;; allerdings: unparsed sind verboten
(collect #/&)
(map nil (lambda (x) (collect x)) name)
(collect #/\; )))))))
((and (eq mode :ent) (rune= c #/%))
(unless (name-start-rune-p (peek-rune input))
(error "Expecting name after %."))
(let ((name (read-name-token input)))
(setf c (read-rune input))
(assert (rune= c #/\;))
(cond (*expand-pe-p*
(recurse-on-entity
zinput name :parameter
(lambda (zinput)
(muffle (car (zstream-input-stack zinput))
:eof))))
(t
(error "No PE here.")))))
((and (eq mode :att) (rune= c #/<))
;; xxx fix error message
(cerror "Eat them in spite of this."
"For no apparent reason #\/< is forbidden in attribute values. ~
You lost -- next time choose SEXPR syntax.")
(collect c))
((and canon-space-p (space-rune-p c))
(collect #/space))
((not (data-rune-p c))
(error "illegal char: ~S." c))
(t
(collect c)))))))
(declare (dynamic-extent #'muffle))
(muffle input (or delim
(let ((delim (read-rune input)))
(assert (member delim '(#/\" #/\')))
delim))))))
(defun read-numeric-entity (input)
;; xxx eof handling
;; The #/& is already read
(let ((res
(let ((c (read-rune input)))
(assert (rune= c #/#))
(setq c (read-rune input))
(cond ((rune= c #/x)
;; hexadecimal
(setq c (read-rune input))
(assert (digit-rune-p c 16))
(prog1
(parse-integer
(with-output-to-string (sink)
(write-char (code-char c) sink)
(while (digit-rune-p (setq c (read-rune input)) 16)
(write-char (code-char c) sink)))
:radix 16)
(assert (rune= c #/\;)))
)
((<= #/0 c #/9)
;; decimal
(prog1
(parse-integer
(with-output-to-string (sink)
(write-char (code-char c) sink)
(while (<= #/0 (setq c (read-rune input)) #/9)
(write-char (code-char c) sink)))
:radix 10)
(assert (rune= c #/\;))) )
(t
(error "Bad char in numeric character entity.") )))))
(unless (data-char-p res)
(error "expansion of numeric character reference (#x~X) is no data char."
res))
res))
(defun read-pi (input)
;; "<?" is already read
(let (name)
(let ((c (peek-rune input)))
(unless (name-start-rune-p c)
(error "Expecting name after '<?'"))
(setf name (read-name-token input)))
(values name
(read-pi-content input))))
(defun read-pi-content (input &aux d)
(read-s? input)
(with-rune-collector (collect)
(block nil
(tagbody
state-1
(setf d (read-rune input))
(unless (data-rune-p d)
(error "Illegal char: ~S." d))
(when (rune= d #/?) (go state-2))
(collect d)
(go state-1)
state-2 ;; #/? seen
(setf d (read-rune input))
(unless (data-rune-p d)
(error "Illegal char: ~S." d))
(when (rune= d #/>) (return))
(when (rune= d #/?)
(collect #/?)
(go state-2))
(collect #/?)
(collect d)
(go state-1)))))
(defun read-comment-content (input &aux d)
(let ((warnedp nil))
(with-rune-collector (collect)
(block nil
(tagbody
state-1
(setf d (read-rune input))
(unless (data-rune-p d)
(error "Illegal char: ~S." d))
(when (rune= d #/-) (go state-2))
(collect d)
(go state-1)
state-2 ;; #/- seen
(setf d (read-rune input))
(unless (data-rune-p d)
(error "Illegal char: ~S." d))
(when (rune= d #/-) (go state-3))
(collect #/-)
(collect d)
(go state-1)
state-3 ;; #/- #/- seen
(setf d (read-rune input))
(unless (data-rune-p d)
(error "Illegal char: ~S." d))
(when (rune= d #/>) (return))
(unless warnedp
(warn "WFC: no '--' in comments please.")
(setf warnedp t))
(when (rune= d #/-)
(collect #/-)
(go state-3))
(collect #/-)
(collect #/-)
(collect d)
(go state-1))))))
(defun read-cdata-sect (input &aux d)
;; <![CDATA[ is already read
;; read anything up to ]]>
(with-rune-collector (collect)
(block nil
(tagbody
state-1
(setf d (read-rune input))
(unless (data-rune-p d)
(error "Illegal char: ~S." d))
(when (rune= d #/\]) (go state-2))
(collect d)
(go state-1)
state-2 ;; #/] seen
(setf d (read-rune input))
(unless (data-rune-p d)
(error "Illegal char: ~S." d))
(when (rune= d #/\]) (go state-3))
(collect #/\])
(collect d)
(go state-1)
state-3 ;; #/\] #/\] seen
(setf d (read-rune input))
(unless (data-rune-p d)
(error "Illegal char: ~S." d))
(when (rune= d #/>)
(return))
(when (rune= d #/\])
(collect #/\])
(go state-3))
(collect #/\])
(collect #/\])
(collect d)
(go state-1)))))
#+(or) ;; FIXME: There is another definition below that looks more reasonable.
(defun read-cdata (input initial-char &aux d)
(cond ((not (data-rune-p initial-char))
(error "Illegal char: ~S." initial-char)))
(with-rune-collector (collect)
(block nil
(tagbody
(cond ((rune= initial-char #/\])
(go state-2))
(t
(collect initial-char)))
state-1
(setf d (peek-rune input))
(when (or (eq d :eof) (rune= d #/<) (rune= d #/&))
(return))
(read-rune input)
(unless (data-rune-p d)
(error "Illegal char: ~S." d))
(when (rune= d #/\]) (go state-2))
(collect d)
(go state-1)
state-2 ;; #/\] seen
(setf d (peek-rune input))
(when (or (eq d :eof) (rune= d #/<) (rune= d #/&))
(collect #/\])
(return))
(read-rune input)
(unless (data-rune-p d)
(error "Illegal char: ~S." d))
(when (rune= d #/\]) (go state-3))
(collect #/\])
(collect d)
(go state-1)
state-3 ;; #/\] #/\] seen
(setf d (peek-rune input))
(when (or (eq d :eof) (rune= d #/<) (rune= d #/&))
(collect #/\])
(collect #/\])
(return))
(read-rune input)
(unless (data-rune-p d)
(error "Illegal char: ~S." d))
(when (rune= d #/>)
(error "For no apparent reason ']]>' in not allowed in a CharData token -- you lost."))
(when (rune= d #/\])
(collect #/\])
(go state-3))
(collect #/\])
(collect #/\])
(collect d)
(go state-1)))))
;; some character categories
#||
(defun name-start-rune-p (rune)
(or (<= #x0041 rune #x005A)
(<= #x0061 rune #x007A)
;; lots more
(>= rune #x0080)
(rune= rune #/_)
(rune= rune #/:)))
(defun name-rune-p (rune)
(or (name-start-rune-p rune)
(rune= rune #/.)
(rune= rune #/-)
(rune<= #/0 rune #/9)))
||#
(defun space-rune-p (rune)
(declare (type rune rune))
(or (rune= rune #/U+0020)
(rune= rune #/U+0009)
(rune= rune #/U+000A)
(rune= rune #/U+000D)))
(defun data-char-p (c)
;; any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
(or (= c #x9) (= c #xA) (= c #xD)
(<= #x20 c #xD7FF)
(<= #xE000 c #xFFFD)
(<= #x10000 c #x10FFFF)))
(defun pubid-char-p (c)
(or (= c #x20) (= c #xD) (= c #xA)
(<= #/a c #/z)
(<= #/A c #/Z)
(<= #/0 c #/9)
(member c '(#/- #/' #/\( #/\) #/+ #/, #/. #//
#/: #/= #/? #/\; #/! #/* #/#
#/@ #/$ #/_ #/%))))
(defun expect (input category)
(multiple-value-bind (cat sem) (read-token input)
(unless (eq cat category)
(error "Expected ~S saw ~S [~S]" category cat sem))
(values cat sem)))
(defun consume-token (input)
(read-token input))
;;;; ---------------------------------------------------------------------------
;;;; Parser
;;;;
(defun p/S (input)
;; S ::= (#x20 | #x9 | #xD | #xA)+
(expect input :S)
(while (eq (peek-token input) :S)
(consume-token input)))
(defun p/S? (input)
;; S ::= (#x20 | #x9 | #xD | #xA)+
(while (eq (peek-token input) :S)
(consume-token input)))
(defun p/name (input)
(nth-value 1 (expect input :name)))
(defun p/attlist-decl (input)
;; [52] AttlistDecl ::= '<!ATTLIST' S Name (S AttDef)* S? '>'
(let (elm-name)
(expect input :|<!ATTLIST|)
(p/S input)
(setf elm-name (p/name input))
(loop
(let ((tok (read-token input)))
(case tok
(:S
(p/S? input)
(cond ((eq (peek-token input) :>)
(consume-token input)
(return))
(t
(multiple-value-bind (name type default) (p/attdef input)
(define-attribute *dtd* elm-name name type default)) )))
(:>
(return))
(otherwise
(error "Expected either another AttDef or end of \"<!ATTLIST\". -- saw ~S."
tok)) )) )))
(defun p/attdef (input)
;; [53] AttDef ::= Name S AttType S DefaultDecl
(let (name type default)
(setf name (p/name input))
(p/S input)
(setf type (p/att-type input))
(p/S input)
(setf default (p/default-decl input))
(values name type default)))
(defun p/list (input item-parser delimiter)
;; Parse something like S? <item> (S? <delimiter> <item>)* S?
;;
(declare (type function item-parser))
(let (res)
(p/S? input)
(setf res (list (funcall item-parser input)))
(loop
(p/S? input)
(cond ((eq (peek-token input) delimiter)
(consume-token input)
(p/S? input)
(push (funcall item-parser input) res))
(t
(return))))
(p/S? input)
(reverse res)))
(defun p/att-type (input)
;; [54] AttType ::= StringType | TokenizedType | EnumeratedType
;; [55] StringType ::= 'CDATA'
;; [56] TokenizedType ::= 'ID' /*VC: ID */
;; /*VC: One ID per Element Type */
;; /*VC: ID Attribute Default */
;; | 'IDREF' /*VC: IDREF */
;; | 'IDREFS' /*VC: IDREF */
;; | 'ENTITY' /*VC: Entity Name */
;; | 'ENTITIES' /*VC: Entity Name */
;; | 'NMTOKEN' /*VC: Name Token */
;; | 'NMTOKENS' /*VC: Name Token */
;; [57] EnumeratedType ::= NotationType | Enumeration
;; [58] NotationType ::= 'NOTATION' S '(' S? Name (S? '|' S? Name)* S? ')'
;; /* VC: Notation Attributes */
;; [59] Enumeration ::= '(' S? Nmtoken (S? '|' S? Nmtoken)* S? ')' /* VC: Enumeration */
(multiple-value-bind (cat sem) (read-token input)
(cond ((eq cat :name)
(cond ((equalp sem '#.(string-rod "CDATA")) :cdata)
((equalp sem '#.(string-rod "ID")) :id)
((equalp sem '#.(string-rod "IDREF")) :idrefs)
((equalp sem '#.(string-rod "IDREFS")) :idrefs)
((equalp sem '#.(string-rod "ENTITY")) :entity)
((equalp sem '#.(string-rod "ENTITIES")) :entities)
((equalp sem '#.(string-rod "NMTOKEN")) :nmtoken)
((equalp sem '#.(string-rod "NMTOKENS")) :nmtokens)
((equalp sem '#.(string-rod "NOTATION"))
;; xxx nmtoken vs name
(let (names)
(p/S input)
(expect input :\()
(setf names (p/list input #'p/name :\| ))
(expect input :\))
(cons :notation names)))
(t
(error "In p/att-type: ~S ~S." cat sem))))
((eq cat :\()
;; xxx nmtoken vs name
(let (names)
;;(expect input :\()
(setf names (p/list input #'p/name :\| ))
(expect input :\))
(cons :enumeration names)))
(t
(error "In p/att-type: ~S ~S." cat sem)) )))
(defun p/default-decl (input)
;; [60] DefaultDecl ::= '#REQUIRED' | '#IMPLIED'
;; | (('#FIXED' S)? AttValue) /* VC: Required Attribute */
;;
;; /* VC: Attribute Default Legal */
;; /* WFC: No < in Attribute Values */
;; /* VC: Fixed Attribute Default */
(multiple-value-bind (cat sem) (peek-token input)
(cond ((eq cat :|#REQUIRED|)
(consume-token input) :required)
((eq cat :|#IMPLIED|)
(consume-token input) :implied)
((eq cat :|#FIXED|)
(consume-token input)
(p/S input)
(list :fixed (p/att-value input)))
((or (eq cat :\') (eq cat :\"))
(list :default (p/att-value input)))
(t
(error "p/default-decl: ~S ~S." cat sem)) )))
;;;;
;; [70] EntityDecl ::= GEDecl | PEDecl
;; [71] GEDecl ::= '<!ENTITY' S Name S EntityDef S? '>'
;; [72] PEDecl ::= '<!ENTITY' S '%' S Name S PEDef S? '>'
;; [73] EntityDef ::= EntityValue | (ExternalID NDataDecl?)
;; [74] PEDef ::= EntityValue | ExternalID
;; [75] ExternalID ::= 'SYSTEM' S SystemLiteral
;; | 'PUBLIC' S PubidLiteral S SystemLiteral
;; [76] NDataDecl ::= S 'NDATA' S Name /* VC: Notation Declared */
(defun p/entity-decl (input)
(let (name def kind)
(expect input :|<!ENTITY|)
(p/S input)
(cond ((eq (peek-token input) :%)
(setf kind :parameter)
(consume-token input)
(p/S input))
(t
(setf kind :general)))
(setf name (p/name input))
(p/S input)
(setf def (p/entity-def input kind))
(define-entity input name kind def)
(p/S? input)
(expect input :\>)))
(defun p/entity-def (input kind)
(multiple-value-bind (cat sem) (peek-token input)
(cond ((member cat '(:\" :\'))
(list :internal (p/entity-value input)))
((and (eq cat :name)
(or (equalp sem '#.(string-rod "SYSTEM"))
(equalp sem '#.(string-rod "PUBLIC"))))
(let (extid ndata)
(setf extid (p/external-id input nil))
(when (eq kind :general) ;NDATA allowed at all?
(cond ((eq (peek-token input) :S)
(p/S? input)
(when (and (eq (peek-token input) :name)
(equalp (nth-value 1 (peek-token input))
'#.(string-rod "NDATA")))
(consume-token input)
(p/S input)
(setf ndata (p/name input))))))
(list :external extid ndata)))
(t
(error "p/entity-def: ~S / ~S." cat sem)) )))
(defun p/entity-value (input)
(let ((delim (if (eq (read-token input) :\") #/\" #/\')))
(read-att-value input
(car (zstream-input-stack input))
:ent
nil
delim)))
(defun p/att-value (input)
(let ((delim (if (eq (read-token input) :\") #/\" #/\')))
(read-att-value input
(car (zstream-input-stack input))
:att
t
delim)))
(defun p/external-id (input &optional (public-only-ok-p nil))
;; xxx public-only-ok-p
(multiple-value-bind (cat sem) (read-token input)
(cond ((and (eq cat :name) (equalp sem '#.(string-rod "SYSTEM")))
(p/S input)
(list :system (p/system-literal input))
)
((and (eq cat :name) (equalp sem '#.(string-rod "PUBLIC")))
(let (pub sys)
(p/S input)
(setf pub (p/pubid-literal input))
(when (eq (peek-token input) :S)
(p/S input)
(when (member (peek-token input) '(:\" :\'))
(setf sys (p/system-literal input))))
(unless (every #'pubid-char-p pub)
(error "Illegal pubid: ~S." (rod-string pub)))
(when (and (not public-only-ok-p)
(null sys))
(error "System identifier needed for this PUBLIC external identifier."))
(list :public pub sys)))
(t
(error "Expected external-id: ~S / ~S." cat sem)))))
;; [11] SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
;; [12] PubidLiteral ::= '"' PubidChar* '"' | "'" (PubidChar - "'")* "'"
;; [13] PubidChar ::= #x20 | #xD | #xA | [a-zA-Z0-9]
;; | [-'()+,./:=?;!*#@$_%]
(defun p/system-literal (input)
(multiple-value-bind (cat) (read-token input)
(cond ((member cat '(:\" :\'))
(let ((delim (if (eq cat :\") #/\" #/\')))
(with-rune-collector (collect)
(loop
(let ((c (read-rune (car (zstream-input-stack input)))))
(cond ((eq c :eof)
(error "EOF in system literal."))
((rune= c delim)
(return))
(t
(collect c))))))))
(t
(error "Expect either \" or \'.")))))
(defun p/pubid-literal (input)
;; xxx check for valid chars
(p/system-literal input))
;;;;
(defun p/element-decl (input)
(let (name content)
(expect input :|<!ELEMENT|)
(p/S input)
(setf name (p/name input))
(p/S input)
(setf content (p/cspec input))
(unless (legal-content-model-p content)
'(error "Illegal content model: ~S." (mu content))
(warn "Illegal content model: ~S." (mu content)))
(p/S? input)
(expect input :\>)
(list :element name content)))
(defun legal-content-model-p (cspec)
(or (eq cspec :PCDATA)
(eq cspec :ANY)
(eq cspec :EMPTY)
(and (consp cspec)
(eq (car cspec) '*)
(consp (cadr cspec))
(eq (car (cadr cspec)) 'or)
(eq (cadr (cadr cspec)) :pcdata)
(every #'vectorp (cddr (cadr cspec))))
(labels ((walk (x)
(cond ((member x '(:PCDATA :ANY :EMPTY))
nil)
((atom x) t)
((and (walk (car x))
(walk (cdr x)))))))
(walk cspec))))
;; wir fahren besser, wenn wir machen:
;; cspec ::= 'EMPTY' | 'ANY' | '#PCDATA'
;; | Name
;; | cs
;; cs ::= '(' S? cspec ( S? '|' S? cs)* S? ')' ('?' | '*' | '+')?
;; und eine post mortem analyse
(defun p/cspec (input &optional (level 0) (only-names-p nil))
(let ((term
(let ((names nil) op-cat op res)
(multiple-value-bind (cat sem) (peek-token input)
(cond ((eq cat :name)
(consume-token input)
(cond ((rod= sem '#.(string-rod "EMPTY"))
:empty)
((rod= sem '#.(string-rod "ANY"))
:any)
(t
sem)))
((and (eq cat :\#PCDATA) (not only-names-p))
(unless (= level 1)
(error "#PCDATA only on top level in content modell."))
(consume-token input)
:pcdata)
((and (eq cat :\() (not only-names-p))
(consume-token input)
(p/S? input)
(setq names (list (p/cspec input (+ level 1))))
(p/S? input)
(let ((on? (eq (car names) :pcdata)))
(cond ((member (peek-token input) '(:\| :\,))
(setf op-cat (peek-token input))
(setf op (if (eq op-cat :\,) 'and 'or))
(while (eq (peek-token input) op-cat)
(consume-token input)
(p/S? input)
(push (p/cspec input (+ level 1) on?) names)
(p/S? input))
(setf res (cons op (reverse names))))
(t
(setf res (car names)))))
(p/S? input)
(expect input :\))
res)
(t
(error "p/cspec - ~s / ~s" cat sem)))))))
(cond ((eq (peek-token input) :?) (consume-token input) (list '? term))
((eq (peek-token input) :+) (consume-token input) (list '+ term))
((eq (peek-token input) :*) (consume-token input) (list '* term))
(t
term))))
;; [52] AttlistDecl ::= '<!ATTLIST' S Name AttDef* S? '>'
;; [52] AttlistDecl ::= '<!ATTLIST' S Name AttDefs S? '>'
;; [52] AttlistDecl ::= '<!ATTLIST' S Name S? '>'
;; [53] AttDefs ::= S Name S AttType S DefaultDecl AttDefs
;; [53] AttDefs ::=
(defun p/notation-decl (input)
(let (name id)
(expect input :|<!NOTATION|)
(p/S input)
(setf name (p/name input))
(p/S input)
(setf id (p/external-id input t))
(p/S? input)
(expect input :\>)
(list :notation-decl name id)))
;;;
(defun p/conditional-sect (input)
(expect input :<!\[ )
(p/S? input)
(multiple-value-bind (cat sem) (read-token input)
(cond ((and (eq cat :name)
(rod= sem '#.(string-rod "INCLUDE")))
(p/include-sect input))
((and (eq cat :name)
(rod= sem '#.(string-rod "IGNORE")))
(p/ignore-sect input))
(t
(error "Expected INCLUDE or IGNORE after \"<![\".")))))
(defun p/include-sect (input)
;; <![INCLUDE is already read.
(p/S? input)
(expect input :\[)
(p/ext-subset-decl input)
(expect input :\])
(expect input :\])
(expect input :\>) )
(defun p/ignore-sect (input)
;; <![IGNORE is already read.
(p/S? input)
(expect input :\[)
(let ((input (car (zstream-input-stack input))))
(let ((level 0))
(do ((c1 (read-rune input) (read-rune input))
(c2 0 c1)
(c3 0 c2))
((= level -1))
(declare (type fixnum level))
(cond ((eq c1 :eof)
(error "EOF in <![IGNORE ... >")))
(cond ((and (rune= c3 #/<) (rune= c2 #/!) (rune= c1 #/\[))
(incf level)))
(cond ((and (rune= c3 #/\]) (rune= c2 #/\]) (rune= c1 #/>))
(decf level))) ))))
(defun p/ext-subset-decl (input)
;; ( markupdecl | conditionalSect | S )*
(loop
(case (let ((*expand-pe-p* nil)) (peek-token input))
(:|<![| (let ((*expand-pe-p* t)) (p/conditional-sect input)))
(:S (consume-token input))
(:eof (return))
((:|<!ELEMENT| :|<!ATTLIST| :|<!ENTITY| :|<!NOTATION| :PI :COMMENT)
(let ((*expand-pe-p* t))
(p/markup-decl input)))
((:pe-reference)
(let ((name (nth-value 1 (read-token input))))
(recurse-on-entity input name :parameter
(lambda (input)
(ecase (entity-source-kind name :parameter)
(:external
(p/ext-subset input))
(:internal
(p/ext-subset-decl input)))
(unless (eq :eof (peek-token input))
(error "Trailing garbage."))))))
(otherwise (return)))) )
(defun p/markup-decl (input)
;; markupdecl ::= elementdecl | AttlistDecl /* VC: Proper Declaration/PE Nesting */
;; | EntityDecl | NotationDecl
;; | PI | Comment /* WFC: PEs in Internal Subset */
(case (peek-token input)
(:|<!ELEMENT| (p/element-decl input))
(:|<!ATTLIST| (p/attlist-decl input))
(:|<!ENTITY| (p/entity-decl input))
(:|<!NOTATION| (p/notation-decl input))
(:PI (consume-token input))
(:COMMENT (consume-token input))
(otherwise
(error "p/markup-decl ~S" (peek-token input)))))
(defun setup-encoding (input xml-header)
(when (xml-header-encoding xml-header)
(let ((enc (find-encoding (xml-header-encoding xml-header))))
(cond (enc
(setf (xstream-encoding (car (zstream-input-stack input)))
enc))
(t
(warn "There is no such encoding: ~S." (xml-header-encoding xml-header)))))))
(defun set-full-speed (input)
(let ((xstream (car (zstream-input-stack input))))
(when xstream
(setf (xstream-speed xstream)
(length (xstream-os-buffer xstream))))
'(warn "Reverting ~S to full speed." input)
))
(defun p/ext-subset (input)
(cond ((eq (peek-token input) :xml-pi)
(let ((hd (parse-xml-pi (cdr (nth-value 1 (peek-token input))) nil)))
(setup-encoding input hd))
(consume-token input)))
(set-full-speed input)
(p/ext-subset-decl input)
(unless (eq (peek-token input) :eof)
(error "Trailing garbage - ~S." (peek-token input))))
(defun p/doctype-decl (input)
(let ((*expand-pe-p* nil))
(let (name extid)
(expect input :|<!DOCTYPE|)
(p/S input)
(setq name (p/name input))
(when (eq (peek-token input) :S)
(p/S input)
(unless (or (eq (peek-token input) :\[ )
(eq (peek-token input) :\> ))
(setf extid (p/external-id input t))))
(p/S? input)
(when (eq (peek-token input) :\[ )
(consume-token input)
(while (progn (p/S? input)
(not (eq (peek-token input) :\] )))
(if (eq (peek-token input) :pe-reference)
(let ((name (nth-value 1 (read-token input))))
(recurse-on-entity input name :parameter
(lambda (input)
(ecase (entity-source-kind name :parameter)
(:external
(p/ext-subset input))
(:internal
(p/ext-subset-decl input)))
(unless (eq :eof (peek-token input))
(error "Trailing garbage.")))))
(p/markup-decl input)))
(consume-token input)
(p/S? input))
(expect input :>)
(when extid
(let* ((xi2 (open-extid (absolute-extid input extid)))
(zi2 (make-zstream :input-stack (list xi2))))
(let ()
(p/ext-subset zi2))))
(list :doctype name extid))))
(defun p/misc*-2 (input)
;; Misc*
(while (member (peek-token input) '(:comment :pi :s))
(when (eq (peek-token input) :pi)
(sax:processing-instruction
*handler*
(car (nth-value 1 (peek-token input)))
(cdr (nth-value 1 (peek-token input)))))
(consume-token input)))
(defvar *handler*)
(defun p/document (input handler)
(let ((*handler* handler)
(*namespace-bindings* *default-namespace-bindings*))
(setf *entities* nil)
(setf *dtd* (make-dtd))
(define-default-entities)
(sax:start-document *handler*)
;; document ::= XMLDecl? Misc* (doctypedecl Misc*)? element Misc*
;; Misc ::= Comment | PI | S
;; xmldecl::='<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
;; sddecl::= S 'standalone' Eq (("'" ('yes' | 'no') "'") | ('"' ('yes' | 'no') '"'))
;;
;; we will use the attribute-value parser for the xml decl.
(let ((*data-behaviour* :DTD))
;; optional XMLDecl?
(cond ((eq (peek-token input) :xml-pi)
(let ((hd (parse-xml-pi (cdr (nth-value 1 (peek-token input))) t)))
(setup-encoding input hd))
;; FIXME: Ceci n'est pas un pi. Should probably go away.
;; (hmot 30/06/03)
(sax:processing-instruction
*handler*
(car (nth-value 1 (peek-token input)))
(cdr (nth-value 1 (peek-token input))))
(read-token input)))
(set-full-speed input)
;; Misc*
(p/misc*-2 input)
;; (doctypedecl Misc*)?
(when (eq (peek-token input) :<!doctype)
(p/doctype-decl input)
(p/misc*-2 input))
;; element
(let ((*data-behaviour* :doc))
(p/element input))
;; optional Misc*
(p/misc*-2 input)
(unless (eq (peek-token input) :eof)
(error "Garbage at end of document."))
(sax:end-document *handler*))))
(defun p/element (input)
(if sax:*namespace-processing*
(p/element-ns input)
(p/element-no-ns input)))
(defun p/element-no-ns (input)
;; [39] element ::= EmptyElemTag | STag content ETag
(multiple-value-bind (cat sem) (read-token input)
(cond ((eq cat :ztag)
(sax:start-element *handler* nil nil (car sem) (build-attribute-list-no-ns (cdr sem)))
(sax:end-element *handler* nil nil (car sem)))
((eq cat :stag)
(sax:start-element *handler* nil nil (car sem) (build-attribute-list-no-ns (cdr sem)))
(p/content input)
(multiple-value-bind (cat2 sem2) (read-token input)
(unless (and (eq cat2 :etag)
(eq (car sem2) (car sem)))
(perror input "Bad nesting. ~S / ~S" (mu sem) (mu (cons cat2 sem2)))))
(sax:end-element *handler* nil nil (car sem)))
(t
(error "Expecting element.")))))
(defun p/element-ns (input)
(destructuring-bind (cat (name &rest attrs))
(multiple-value-list (read-token input))
(let ((ns-decls (declare-namespaces attrs)))
(multiple-value-bind (ns-uri prefix local-name) (decode-qname name)
(declare (ignore prefix))
(let ((attlist (build-attribute-list-ns attrs)))
(cond ((eq cat :ztag)
(sax:start-element *handler* ns-uri local-name name attlist)
(sax:end-element *handler* ns-uri local-name name))
((eq cat :stag)
(sax:start-element *handler* ns-uri local-name name attlist)
(p/content input)
(multiple-value-bind (cat2 sem2) (read-token input)
(unless (and (eq cat2 :etag)
(eq (car sem2) name))
(perror input "Bad nesting. ~S / ~S" (mu name) (mu (cons cat2 sem2)))))
(sax:end-element *handler* ns-uri local-name name))
(t
(error "Expecting element.")))))
(undeclare-namespaces ns-decls))))
(defun perror (stream format-string &rest format-args)
(when (zstream-p stream)
(setf stream (car (zstream-input-stack stream))))
(error "Parse error at line ~D column ~D: ~A"
(xstream-line-number stream)
(xstream-column-number stream)
(apply #'format nil format-string format-args)))
(defun p/content (input)
;; [43] content ::= (element | CharData | Reference | CDSect | PI | Comment)*
(multiple-value-bind (cat sem) (peek-token input)
(case cat
((:stag :ztag)
(p/element input)
(p/content input))
((:cdata)
(consume-token input)
(sax:characters *handler* sem)
(p/content input))
((:entity-ref)
(let ((name sem))
(consume-token input)
(append ;; nil #+(OR)
(recurse-on-entity input name :general
(lambda (input)
(prog1
(ecase (entity-source-kind name :general)
(:internal (p/content input))
(:external (p/ext-parsed-ent input)))
(unless (eq (peek-token input) :eof)
(error "Trailing garbage. - ~S" (peek-token input))))))
(p/content input))))
((:<!\[)
(consume-token input)
(cons
(let ((input (car (zstream-input-stack input))))
(unless (and (rune= #/C (read-rune input))
(rune= #/D (read-rune input))
(rune= #/A (read-rune input))
(rune= #/T (read-rune input))
(rune= #/A (read-rune input))
(rune= #/\[ (read-rune input)))
(error "After '<![', 'CDATA[' is expected."))
(sax:start-cdata *handler*)
(sax:characters *handler* (read-cdata-sect input))
(sax:end-cdata *handler*))
(p/content input)))
((:pi)
(consume-token input)
(sax:processing-instruction *handler* (car sem) (cdr sem))
(p/content input))
((:comment) ;; FIXME: should call sax:comment. How does this work?
(consume-token input)
(p/content input))
(otherwise
nil))))
;; [78] extParsedEnt ::= TextDecl? contentw
;; [79] extPE ::= TextDecl? extSubsetDecl
(defstruct xml-header
version
encoding
(standalone-p nil))
(defun p/ext-parsed-ent (input)
;; [78] extParsedEnt ::= '<?xml' VersionInfo? EncodingDecl S? '?>' content
(when (eq (peek-token input) :xml-pi)
(let ((hd (parse-xml-pi (cdr (nth-value 1 (peek-token input))) nil)))
(setup-encoding input hd))
(consume-token input) )
(set-full-speed input)
(p/content input))
(defun parse-xml-pi (content sd-ok-p)
;; --> xml-header
;;(make-xml-header))
(let* ((res (make-xml-header))
(i (make-rod-xstream content))
(atts (read-attribute-list 'foo i t))) ;xxx on 'foo
(unless (eq (peek-rune i) :eof)
(error "Garbage at end of XML PI."))
;; versioninfo muss da sein
;; dann ? encodingdecl
;; dann ? sddecl
;; dann ende
(when (and (not (eq (caar atts) (intern-name '#.(string-rod "version"))))
sd-ok-p)
(error "XML PI needs version."))
(when (eq (caar atts) (intern-name '#.(string-rod "version")))
(unless (and (>= (length (cdar atts)) 1)
(every (lambda (x)
(or (<= #/a x #/z)
(<= #/A x #/Z)
(<= #/0 x #/9)
(rune= x #/_)
(rune= x #/.)
(rune= x #/:)
(rune= x #/-)))
(cdar atts)))
(error "Bad XML version number: ~S." (rod-string (cdar atts))))
(setf (xml-header-version res) (rod-string (cdar atts)))
(pop atts))
(when (eq (caar atts) (intern-name '#.(string-rod "encoding")))
(unless (and (>= (length (cdar atts)) 1)
(every (lambda (x)
(or (<= #/a x #/z)
(<= #/A x #/Z)
(<= #/0 x #/9)
(rune= x #/_)
(rune= x #/.)
(rune= x #/-)))
(cdar atts))
((lambda (x)
(or (<= #/a x #/z)
(<= #/A x #/Z)
(<= #/0 x #/9)))
(aref (cdar atts) 0)))
(error "Bad XML encoding name: ~S." (rod-string (cdar atts))))
(setf (xml-header-encoding res) (rod-string (cdar atts)))
(pop atts))
(when (and sd-ok-p (eq (caar atts) (intern-name '#.(string-rod "standalone"))))
(unless (or (rod= (cdar atts) '#.(string-rod "yes"))
(rod= (cdar atts) '#.(string-rod "no")))
(error "Hypersensitivity pitfall: ~
XML PI's 'standalone' attribute must be exactly \"yes\" or \"no\" and not ~S."
(rod-string (cdar atts))))
(setf (xml-header-standalone-p res)
(if (rod-equal '#.(string-rod "yes") (cdar atts))
:yes
:no))
(pop atts))
(when atts
(error "XML designers decided to disallow future extensions to the set ~
of allowed XML PI's attributes -- you might have lost big on ~S (~S)"
(rod-string content) sd-ok-p
))
res))
;;;; ---------------------------------------------------------------------------
;;;; mu
;;;;
(defun mu (x)
(cond ((stringp x) x)
((vectorp x) (rod-string x))
((consp x)
(cons (mu (car x)) (mu (cdr x))))
(x)))
;;;; ---------------------------------------------------------------------------
;;;;
;;;; canonical XML according to James Clark
;;;;
;;;; User inteface ;;;;
(defun parse-file (filename &optional (handler (make-instance 'dom-impl::dom-builder)))
(with-open-xstream (input filename)
(setf (xstream-name input)
(make-stream-name
:entity-name "main document"
:entity-kind :main
:file-name filename))
(let ((zstream (make-zstream :input-stack (list input))))
(peek-rune input)
(progn 'time
(p/document zstream handler)))))
(defun parse-stream (stream &optional (handler (make-instance 'dom-impl::dom-builder)))
(let* ((xstream
(make-xstream
stream
:name (make-stream-name
:entity-name "main document"
:entity-kind :main
:file-name (or (ignore-errors (pathname *standard-output*))
*default-pathname-defaults*))
:initial-speed 1))
(zstream (make-zstream :input-stack (list xstream))))
(p/document zstream handler)))
(defun parse-string (string &optional (handler (make-instance 'dom-impl::dom-builder)))
(let* ((x (string->xstream string))
(z (make-zstream :input-stack (list x))))
(p/document z handler)))
(defun string->xstream (string)
(make-rod-xstream (string-rod string)))
;;;;
#+ALLEGRO
(defmacro sp (&body body)
`(progn
(prof:with-profiling (:type :space) .,body)
(prof:show-flat-profile)))
#+ALLEGRO
(defmacro tm (&body body)
`(progn
(prof:with-profiling (:type :time) .,body)
(prof:show-flat-profile)))
;;;;
(defun zstream-push (new-xstream zstream)
(cond ((find-if (lambda (x)
(and (xstream-p x)
(eql (stream-name-entity-name (xstream-name x))
(stream-name-entity-name (xstream-name new-xstream)))
(eql (stream-name-entity-kind (xstream-name x))
(stream-name-entity-kind (xstream-name new-xstream)))))
(zstream-input-stack zstream))
(error "Infinite recursion.")))
(push new-xstream (zstream-input-stack zstream))
zstream)
(defun recurse-on-entity (zstream name kind continuation)
(assert (not (zstream-token-category zstream)))
;;(sleep .2)
;;(warn "~S / ~S[~S]." (zstream-input-stack zstream) (mu name) kind)
(call-with-entity-expansion-as-stream
zstream
(lambda (new-xstream)
(push :stop (zstream-input-stack zstream))
(zstream-push new-xstream zstream)
(prog1
(funcall continuation zstream)
(assert (eq (peek-token zstream) :eof))
(assert (eq (pop (zstream-input-stack zstream)) new-xstream))
(close-xstream new-xstream)
(assert (eq (pop (zstream-input-stack zstream)) :stop))
(setf (zstream-token-category zstream) nil)
'(consume-token zstream)) )
name kind))
(defun merge-sysid (sysid base)
(merge-pathnames sysid base))
(defun open-sysid (sysid)
(open sysid :element-type '(unsigned-byte 8) :direction :input))
;;;;
(defparameter *test-files*
'(;;"jclark:xmltest;not-wf;*;*.xml"
"jclark:xmltest;valid;*;*.xml"
;;"jclark:xmltest;invalid;*.xml"
))
(defun run-all-tests (&optional (test-files *test-files*))
(let ((failed nil))
(dolist (k test-files)
(dolist (j (sort (directory k) #'string< :key #'pathname-name))
(unless (test-file j)
(push j failed))))
(fresh-line)
(cond (failed
(write-string "**** Test failed on")
(dolist (k failed)
(format t "~%**** ~S." k))
nil)
(t
(write-string "**** Test passed!")
t))))
(defun test-file (filename)
(let ((out-filename (merge-pathnames "out/" filename)))
(if (probe-file out-filename)
(positive-test-file filename out-filename)
(negative-test-file filename))))
(defun positive-test-file (filename out-filename)
(multiple-value-bind (nodes condition)
(ignore-errors (parse-file filename))
(cond (condition
(warn "**** Error in ~S: ~A." filename condition)
nil)
(t
(let (res equal?)
(setf res (with-output-to-string (sink)
(unparse-document nodes sink)))
(setf equal?
(with-open-file (in out-filename :direction :input :element-type 'character)
(do ((i 0 (+ i 1))
(c (read-char in nil nil) (read-char in nil nil)))
((or (eq c nil) (= i (length res)))
(and (eq c nil) (= i (length res))))
(unless (eql c (char res i))
(return nil)))))
(cond ((not equal?)
(format t "~&**** Test failed on ~S." filename)
(fresh-line)
(format t "** me: ~A" res)
(fresh-line)
(format t "** he: " res)
(finish-output)
(with-open-file (in out-filename :direction :input :element-type 'character)
(do ((c (read-char in nil nil) (read-char in nil nil)))
((eq c nil))
(write-char c)))
nil)
(t
t)))))))
(defun negative-test-file (filename)
(multiple-value-bind (nodes condition)
(ignore-errors (parse-file filename))
(declare (ignore nodes))
(cond (condition
t)
(t
(warn "**** negative test failed on ~S." filename)))))
;;;;
(progn
(defmethod dom:create-processing-instruction ((document null) target data)
(declare (ignorable document target data))
nil)
(defmethod dom:append-child ((node null) child)
(declare (ignorable node child))
nil)
(defmethod dom:create-element ((document null) name)
(declare (ignorable document name))
nil)
(defmethod dom:set-attribute ((document null) name value)
(declare (ignorable document name value))
nil)
(defmethod dom:create-text-node ((document null) data)
(declare (ignorable document data))
nil)
(defmethod dom:create-cdata-section ((document null) data)
(declare (ignorable document data))
nil)
)
;;; Implementation of a simple but faster DOM.
(defclass simple-document ()
((children :initform nil :accessor simple-document-children)))
(defstruct node
parent)
(defstruct (processing-instruction (:include node))
target
data)
(defstruct (text (:include node)
(:constructor make-text-boa (parent data)))
data)
(defstruct (element (:include node))
gi
attributes
children)
(defmethod dom:create-processing-instruction ((document simple-document) target data)
(make-processing-instruction :target target :data data))
(defmethod dom:append-child ((node element) child)
(setf (node-parent child) node)
(push child (element-children node)))
(defmethod dom:append-child ((node simple-document) child)
(push child (simple-document-children node))
nil)
(defmethod dom:create-element ((document simple-document) name)
(make-element :gi name))
(defmethod dom:set-attribute ((node element) name value)
(push (cons name value)
(element-attributes node)))
(defmethod dom:create-text-node ((document simple-document) data)
(make-text-boa nil data))
(defmethod dom:create-cdata-section ((document simple-document) data)
(make-text-boa nil data))
#||
(defmacro read-data-until* ((predicate input res res-start res-end) &body body)
;; fast variant -- for now disabled for no apparent reason
;; -> res, res-start, res-end
`(let* ((rptr (xstream-read-ptr ,input))
(p0 rptr)
(fptr (xstream-fill-ptr ,input))
(buf (xstream-buffer ,input))
,res ,res-start ,res-end)
(declare (type fixnum rptr fptr p0)
(type (simple-array read-element (*)) buf))
(loop
(cond ((%= rptr fptr)
;; underflow -- hmm inject the scratch-pad with what we
;; read and continue, while using read-rune and collecting
;; d.h. besser wäre hier auch while-reading zu benutzen.
(setf (xstream-read-ptr ,input) rptr)
(multiple-value-setq (,res ,res-start ,res-end)
(with-rune-collector/raw (collect)
(do ((i p0 (%+ i 1)))
((%= i rptr))
(collect (%rune buf i)))
(let (c)
(loop
(cond ((%= rptr fptr)
(setf (xstream-read-ptr ,input) rptr)
(setf c (peek-rune input))
(cond ((eq c :eof)
(return)))
(setf rptr (xstream-read-ptr ,input)
fptr (xstream-fill-ptr ,input)
buf (xstream-buffer ,input)))
(t
(setf c (%rune buf rptr))))
(cond ((,predicate c)
;; we stop
(setf (xstream-read-ptr ,input) rptr)
(return))
(t
;; we continue
(collect c)
(setf rptr (%+ rptr 1))) )))))
(return))
((,predicate (%rune buf rptr))
;; we stop
(setf (xstream-read-ptr ,input) rptr)
(setf ,res buf ,res-start p0 ,res-end rptr)
(return) )
(t
;; we continue
(setf rptr (%+ rptr 1))) ))
,@body ))
||#
;(defun read-data-until (predicate input continuation)
; )
(defmacro read-data-until* ((predicate input res res-start res-end) &body body)
"Read data from `input' until `predicate' applied to the read char
turns true. Then execute `body' with `res', `res-start', `res-end'
bound to denote a subsequence (of RUNEs) containing the read portion.
The rune upon which `predicate' turned true is neither consumed from
the stream, nor included in `res'.
Keep the predicate short, this it may be included more than once into
the macro's expansion."
;;
(let ((input-var (gensym))
(collect (gensym))
(c (gensym)))
`(LET ((,input-var ,input))
(MULTIPLE-VALUE-BIND (,res ,res-start ,res-end)
(WITH-RUNE-COLLECTOR/RAW (,collect)
(LOOP
(LET ((,c (PEEK-RUNE ,input-var)))
(COND ((EQ ,c :EOF)
;; xxx error message
(RETURN))
((FUNCALL ,predicate ,c)
(RETURN))
(t
(,collect ,c)
(CONSUME-RUNE ,input-var))))))
(LOCALLY
,@body)))))
(defun read-name-token (input)
(read-data-until* ((lambda (rune)
(declare (type rune rune))
(not (name-rune-p rune)))
input
r rs re)
(intern-name r rs re)))
(defun read-cdata (input)
(read-data-until* ((lambda (rune)
(declare (type rune rune))
(or (%= rune #/<) (%= rune #/&)))
input
source start end)
(locally
(declare (type (simple-array rune (*)) source)
(type ufixnum start)
(type ufixnum end)
(optimize (speed 3) (safety 0)))
(let ((res (make-array (%- end start) :element-type 'rune)))
(declare (type (simple-array rune (*)) res))
(let ((i (%- end start)))
(declare (type ufixnum i))
(loop
(setf i (- i 1))
(setf (%rune res i) (%rune source (the ufixnum (+ i start))))
(when (= i 0)
(return))))
res))))
(defun internal-entity-expansion (name)
(let ((e (assoc (list :general name) *entities* :test #'equal)))
(unless e
(error "Entity '~A' is not defined." (rod-string name)))
(unless (eq :internal (cadr e))
(error "Entity '~A' is not an internal entity."))
(or (cadddr e)
(car
(setf (cdddr e)
(cons (find-internal-entity-expansion name) nil))))))
(defun find-internal-entity-expansion (name)
(let ((zinput (make-zstream)))
(with-rune-collector-3 (collect)
(labels ((muffle (input)
(let (c)
(loop
(setf c (read-rune input))
(cond ((eq c :eof)
(return))
((rune= c #/&)
(setf c (peek-rune input))
(cond ((rune= c #/#)
(let ((c (read-numeric-entity input)))
(%put-rune c collect)))
(t
(unless (name-start-rune-p (peek-rune input))
(error "Expecting name after &."))
(let ((name (read-name-token input)))
(setf c (read-rune input))
(assert (rune= c #/\;))
(recurse-on-entity
zinput name :general
(lambda (zinput)
(muffle (car (zstream-input-stack zinput)))))))))
((and (rune= c #/<))
;; xxx fix error message
(cerror "Eat them in spite of this."
"For no apparent reason #\/< is forbidden in attribute values. ~
You lost -- next time choose SEXPR syntax.")
(collect c))
((space-rune-p c)
(collect #/space))
((not (data-rune-p c))
(error "illegal char: ~S." c))
(t
(collect c)))))))
(declare (dynamic-extent #'muffle))
(recurse-on-entity
zinput name :general
(lambda (zinput)
(muffle (car (zstream-input-stack zinput))))) ))))
#+(or) ;; Do we need this? Not called anywhere
(defun ff (name)
(let ((input (make-zstream)))
(let ((*data-behaviour* :doc)
(*document* (make-instance 'simple-document)))
(recurse-on-entity
input name :general
(lambda (input)
(prog1
(ecase (entity-source-kind name :general)
(:internal (p/content input))
(:external (p/ext-parsed-ent input)))
(unless (eq (peek-token input) :eof)
(error "Trailing garbage. - ~S" (peek-token input)))))))))
(defun read-att-value-2 (input)
(let ((delim (read-rune input)))
(unless (member delim '(#/\" #/\') :test #'eql)
(error "Bad attribute value delimiter ~S, must be either #\\\" or #\\\'."
(if (< delim char-code-limit) (code-char delim) delim)))
(with-rune-collector-4 (collect)
(loop
(let ((c (read-rune input)))
(cond ((eq c :eof)
(error "EOF"))
((rune= c delim)
(return))
((rune= #/& c)
(multiple-value-bind (kind sem) (read-entity-ref input)
(ecase kind
(:numeric
(%put-rune sem collect))
(:named
(let* ((exp (internal-entity-expansion sem))
(n (length exp)))
(declare (type (simple-array rune (*)) exp))
(do ((i 0 (%+ i 1)))
((%= i n))
(collect (%rune exp i))))))))
((space-rune-p c)
(collect #x20))
(t
(collect c))))))))
;;;;;;;;;;;;;;;;;
;;; Namespace stuff
(defvar *namespace-bindings* ())
(defvar *default-namespace-bindings*
'((#"" . nil)
(#"xmlns" . #"http://www.w3.org/2000/xmlns/")
(#"xml" . #"http://www.w3.org/XML/1998/namespace")))
;; We already know that name is part of a valid XML name, so all we
;; have to check is that the first rune is a name-start-rune and that
;; there is not colon in it.
(defun nc-name-p (name)
(and (name-start-rune-p (rune name 0))
(notany #'(lambda (rune) (rune= #/: rune)) name)))
(defun split-qname (qname)
(declare (type glisp:simple-rod qname))
(let ((pos (position #/: qname)))
(if pos
(let ((prefix (subseq qname 0 pos))
(local-name (subseq qname (1+ pos))))
(if (nc-name-p local-name)
(values prefix local-name)
(error "~S is not a valid NcName." local-name)))
(values () qname))))
(defun decode-qname (qname)
"decode-qname name => namespace-uri, prefix, local-name"
(declare (type glisp:simple-rod qname))
(multiple-value-bind (prefix local-name) (split-qname qname)
(let ((uri (find-namespace-binding prefix)))
(if uri
(values uri prefix local-name)
(values nil nil nil)))))
(defun find-namespace-binding (prefix)
(cdr (or (assoc prefix *namespace-bindings* :test #'rod=)
(error "Undeclared namespace prefix: ~A" (rod-string prefix)))))
;; FIXME: Should probably be refactored by adding :start and :end to rod=/rod-equal
(defun rod-starts-with (prefix rod)
(and (<= (length prefix) (length rod))
(dotimes (i (length prefix) t)
(unless (rune= (rune prefix i) (rune rod i))
(return nil)))))
(defun xmlns-attr-p (attr-name)
(rod-starts-with #.(string-rod "xmlns") attr-name))
(defun attrname->prefix (attrname)
(if (< 5 (length attrname))
(subseq attrname 6)
nil))
(defun find-namespace-declarations (attr-alist)
(mapcar #'(lambda (attr)
(cons (attrname->prefix (car attr)) (cdr attr)))
(remove-if-not #'xmlns-attr-p attr-alist :key #'car)))
(defun declare-namespaces (attr-alist)
(let ((ns-decls (find-namespace-declarations attr-alist)))
(dolist (ns-decl ns-decls )
;; check some namespace validity constraints
;; FIXME: Would be nice to add "this is insane, go ahead" restarts
(let ((prefix (car ns-decl))
(uri (if (rod= #"" (cdr ns-decl))
nil
(cdr ns-decl))))
(cond
((and (rod= prefix #"xml")
(not (rod= uri #"http://www.w3.org/XML/1998/namespace")))
(error "Attempt to rebind the prefix \"xml\" to ~S." (mu uri)))
((and (rod= uri #"http://www.w3.org/XML/1998/namespace")
(not (rod= prefix #"xml")))
(error "The namespace URI \"http://www.w3.org/XML/1998/namespace\" ~
may not be bound to the prefix ~S, only \"xml\" is legal."
(mu prefix)))
((and (rod= prefix #"xmlns")
(rod= uri #"http://www.w3.org/2000/xmlns/"))
(error "Attempt to bind the prefix \"xmlns\" to its predefined ~
URI \"http://www.w3.org/2000/xmlns/\", which is ~
forbidden for no good reason."))
((rod= prefix #"xmlns")
(error "Attempt to bind the prefix \"xmlns\" to the URI ~S, ~
but it may not be declared." (mu uri)))
((rod= uri #"http://www.w3.org/2000/xmlns/")
(error "The namespace URI \"http://www.w3.org/2000/xmlns/\" may ~
not be bound to prefix ~S (or any other)." (mu prefix)))
((and (rod= uri #"") prefix)
(error "Only the default namespace (the one without a prefix) may ~
be bound to an empty namespace URI, thus undeclaring it."))
(t
(push (cons prefix uri) *namespace-bindings*)
(sax:start-prefix-mapping *handler* (car ns-decl) (cdr ns-decl))))))
ns-decls))
(defun undeclare-namespaces (ns-decls)
(dolist (ns-decl ns-decls)
(setq *namespace-bindings* (delete ns-decl *namespace-bindings*))
(sax:end-prefix-mapping *handler* (car ns-decl))))
(defstruct attribute
namespace-uri
local-name
qname
value)
(defun build-attribute-list-no-ns (attr-alist)
(mapcar #'(lambda (pair) (make-attribute :qname (car pair) :value (cdr pair)))
attr-alist))
;; FIXME: Use a non-braindead way to enforce attribute uniqueness
(defun build-attribute-list-ns (attr-alist)
(let (attributes)
(dolist (pair attr-alist)
(when (or (not (xmlns-attr-p (car pair)))
sax:*include-xmlns-attributes*)
(push (build-attribute (car pair) (cdr pair)) attributes)))
;; 5.3 Uniqueness of Attributes
;; In XML documents conforming to [the xmlns] specification, no
;; tag may contain two attributes which:
;; 1. have identical names, or
;; 2. have qualified names with the same local part and with
;; prefixes which have been bound to namespace names that are
;; identical.
;;
;; 1. is checked by read-tag-2, so we only deal with 2 here
(do ((sublist attributes (cdr sublist)))
((null sublist) attributes)
(let ((attr-1 (car sublist)))
(when (and (attribute-namespace-uri attr-1)
(find-if #'(lambda (attr-2)
(and (rod= (attribute-namespace-uri attr-1)
(attribute-namespace-uri attr-2))
(rod= (attribute-local-name attr-1)
(attribute-local-name attr-2))))
(cdr sublist)))
(error "Multiple definitions of attribute ~S in namespace ~S."
(mu (attribute-local-name attr-1))
(mu (attribute-namespace-uri attr-1))))))))
(defun build-attribute (name value)
(multiple-value-bind (prefix local-name) (split-qname name)
(declare (ignorable local-name))
(if (or (not prefix) ;; default namespace doesn't apply to attributes
(and (rod= #"xmlns" prefix) (not sax:*use-xmlns-namespace*)))
(make-attribute :qname name :value value)
(multiple-value-bind (uri prefix local-name)
(decode-qname name)
(declare (ignore prefix))
(make-attribute :qname name
:value value
:namespace-uri uri
:local-name local-name)))))
;;; Faster constructors
;; Since using the general DOM interface to construct the parsed trees
;; may turn out to be quite expensive (That depends on the underlying
;; DOM implementation). A particular DOM implementation may choose to
;; implement an XML:FAST-CONSTRUCTORS method:
;; XML:FAST-CONSTRUCTORS document [method]
;;
;; Return an alist of constructors suitable for the document `document'.
;;
;; (:MAKE-TEXT document parent data)
;; (:MAKE-PROCESSING-INSTRUCTION document parent target content)
;; (:MAKE-NODE document parent attributes content)
;; [`attributes' now in turn is an alist]
;; (:MAKE-CDATA document parent data)
;; (:MAKE-COMMENT document parent data)
;;
;;;;;;;;;;;;;;;;;
;; System Identifier Protocol
;; A system identifier is an object obeying to the system identifier
;; protocol. Often something like an URL or a pathname.
;; OPEN-SYS-ID sys-id [generic function]
;;
;; Opens the resource associated with the system identifier `sys-id'
;; for reading and returns a stream. For now it is expected, that the
;; stream is an octet stream (one of element type (unsigned-byte 8)).
;;
;; More precisely: The returned object only has to obey to the xstream
;; controller protocol. (That is it has to provide implementations for
;; READ-OCTETS and XSTREAM-CONTROLLER-CLOSE).
;; MERGE-SYS-ID sys-id base [generic function]
;;
;; Merges two system identifiers. That is resolve `sys-id' relative to
;; `base' yielding an absolute system identifier suitable for
;; OPEN-SYS-ID.
;; xstream Controller Protocol
;;
;;
#||
(defun xml-parse (system-id &key document standalone-p)
)
||#