Tuesday, April 28, 2009

XML - Entity References

When we go about writing a web page, often enough, we like using all sorts of characters in our text, like exclamation points (!), ampersands (&), and less than signs (<). Sooner or later, it becomes obvious that we need a way to put a character in our web pages without using the character itself. For example, the less than sign (<) is part of XML markup, so it cannot be used in the text itself, so to remedy this situation the entity reference &lt; is used. This expands to (<) when the page is viewed in a browser, so the desired effect is achieved.

The concept of XML entities was borrowed from SGML, and so their inner workings are very similar. Another borrowing from SGML is the DOCTYPE declaration. While backwards-compatibility was an important objective for XML 1.0, it is a low priority now that XML processors are ubiquitous, and rarely depend on more general SGML processing. In fact, now that XML Schema has replaced almost every aspect of the Document Type Declaration (DTD), it is likely that the use of DTDs will diminish. Since XML Schema provides a richer environment for declaring elements and defining types, it seems painful to have to use DTDs. This naturally leads to the question of whether or not DTDs can be removed from the XML specification.

Suppose DTDs were removed from the XML specification (perhaps for XML 1.5). What else would have to be changed in order to maintain consistency? Surprisingly, only entities. Everything else can be equivalently (or better) declared in XML Schema, which also provides rich types that can increase validity constraints where DTDs would not. However, XML Schema does not provide a way to define entities. So if a new mechanism for entity definition is required, then it could be one of the following options:

  • Use a simplified DOCTYPE declaration for compatibility.
  • Create a new processing instruction to include entities.
  • Let the user agent handle entities however it wants.

Of these options, the new processing instruction would be the most consistent with W3C's other specifications, such as Associating Style Sheets with XML documents. Also, it would be a chance to codify current best practice. XML allows entities to be defined internally (within the document) and externally (in another file). Most documents that use entities either depend on a standard list of entities (like HTML Entities) or directly include a file which has entity definitions. So while a hypothetical processing instruction <?xml-entity copy "&#169;"?> would do the job, it would go against everything. A hypothetical processing instruction <?xml-entities href="htmllat1.ent"?> would be more in line with current usage. Since this would introduce a new language for this external file, there would have to be at least two parts to a specification of this idea:

  • entSubset  ::= (GEDecl | Comment | S)*
  • EntitiesPI ::= '<?xml-entities' S PseudoAtt S? '?>'
    where the allowed pseudo-attributes are:
    href    CDATA #REQUIRED
    charset CDATA #IMPLIED

Another benefit of using a processing instruction instead of some other method is that you do not have to wait until a new XML processor is implemented. In the worst case scenario, you could pass the document through a tool that understands that processing instruction, and the rest can be handled by a normal XML processor. Thus, you can live in a DTD-free world without waiting for XML 1.5 ... you can have it your way, today.