Wednesday, June 3, 2009

XML Infoset Models

The XML Information Set (Infoset) defines a number of classes that can be used to model XML documents, and to some extent, this is exactly the model used in other W3C specifications. However, some specifications use less or more than this set of classes, and it is insightful to compare them in order to find common subsets, and exhaustive supersets. In the comparison below "Info" means XML Infoset and "Path" means the XPath/XQuery Data Model. Also, the abbreviations (Y = yes/required, N = no/unspecified, O = optional) are used.

Class Name XMLDOMEXIInfoPath
CharacterData YYYYN
Comment YYOYY
DocumentFragment NYYNN
DocumentType YOOYN
Entity YOOYN
EntityReference YOOYN
Notation YONYN
ProcessingInstruction YOOYY

Technically, XML Infoset does not include a CharacterData class, but includes a Character class that can be made into a list, which would be isomorphic to the CharacterData class. DOM only requires the optional classes if hasFeature("XML", "1.0") is implemented, so for all XML DOM implementations, these are not optional. Also, EXI requires the Namespace class, even though preserving the namespace prefix strings is optional. XPath has the smallest model, which only supports 7 of the above classes.

It is also interesting to note that Comments, Namespaces and ProcessingInstructions are not forbidden by any of the standards considered above. Also, as you can see, the only classes required by all W3C models are: Attributes, Elements, Documents, Namespaces and Text. Why all the bloat?

1 comment:

  1. It's not really productive to call something bloat if you don't see it in all of some arbitrarily-chosen list of features.

    XML itself certainly has the notion of partial documents (they are external entities) so I['m not sure what your first column represents.

    The XDM doesn't require implementations to make information available about external entities, but it does require that they track base URIs. "Character data" (CDATA) is not distinguished from text in the XDM, becasue it's all been unescaped.

    Most of the features are there for historical reasons though, or for compatibility with SGML-based systems (including the Web and HTML).