04 August 2007

XML-4: Declarations

Disclaimer: I am not an expert on this topic, but a student. I am hoping that these notes will be of help or interest to others trying to understand what XML is and how it works. My notes are not as well-composed as I would like, and I've been more interested in correcting errors than poor composition. Moreover, there's a lot of repetition that I've trying to prune back in subsequent revisions.

From my previous post on XML:
Another important component to XML documents is the document type declaration (DTD) , which includes a list of the elements required in a valid document; or, if there are multiple kinds of documents defined, specifies what must be included in each. DTD's are not the same thing as style sheets that define the tags; DTD's define the elements that exist, what attributes are allowed, and what entities are accepted. A valid XML document has this declaration, and conforms to it. The DTD is sometimes included within the page it describes, but more frequently it makes sense for the DTD to be a file (*.dtd) that serves the whole domain. Examples of DTD code can be found here.

(Some of this passage may be missing in the revised version)
All elements need to be declared, or formally listed in the document type declaration (DTD). A DTD may included with the source document, or it may be a separate file. A DTD includes a long list of statements like , which in this case declares the element "PROGRAM." While that's necessary for a source document to be valid, and [therefore] readable to an XSL compiler, declarations require somewhat more.

The declaration above does not specify what type of data is allowed to belong to the element PROGRAM. Or rather, it does; "ANY" means anything. Ideally, however, XML documents are declared with the required elements; for example, (PERFORMANCE, REVIEW) might be the two child elements, with
<!ELEMENT PERFORMANCE (DATE, WORK, PERFORMER)>
declaring that PERFORMANCE includes the child elements of DATE, WORK, and PERFORMER. An element with no children declares its data type as (#PCDATA), i.e., it contains only parsed character data (no tags).

The structure above includes listings where each listing has one of the specified elements, e.g.,
<!ELEMENT PERFORMANCE (DATE, WORK, PERFORMER)>
means each PERFORMANCE has one DATE, one WORK, and one PERFORMER.
<!ELEMENT PERFORMANCE (DATE, WORK+, PERFORMER+)>
allows for one or more of each, and
<!ELEMENT PERFORMANCE (DATE, WORK?, PERFORMER+)>
allows for events in which there may be no WORK (as, for example, an appearance by a famous comedian) but one or more performers. Conversely, you might want to specify that there could be anywhere from zero to many of a child element; then one would declare
<!ELEMENT PERFORMER (CONDUCTOR*, VOCALIST*, INSTRUMENTALIST*, SPEAKER*)>.
Otherwise, if the number of each child element is fixed—say, at four—then you include the named element four times in the declaration:
<!ELEMENT SEASON (PERFORMANCE, PERFORMANCE, PERFORMANCE, PERFORMANCE)>).
Again, one can specify if, each time an element occurs, it ought to be paired with another:
<!ELEMENT PERFORMANCE (DATE, WORK, PERFORMER, REVIEW)*>
Here's another way of representing the same association of elements:
<!ELEMENT PERFORMANCE (DATE| WORK| PERFORMER| REVIEW)+ >
This version allows one or more of the listed items, but they must have at least one.

At this point it might be useful to illustrate once more the concept of the tree. I've edited the image a bit.

Click image to see in original context.

I've sort of vandalized the original so that you can see the idea of how XSL's component parsing system, XPath, negotiates the elements and element children in order to render XML source documents.

Turning away from the hierarchical nature of elements, there are three other concepts that need declarations spelled out. One is empty elements, such as images and line breaks.
<!ELEMENT IMG EMPTY>
The second is element attributes, or adjectives for elements. Element attributes are pretty simple. Here's an example I actually used in this very blog post:
<IMG SRC="http://farm2~.jpg" HEIGHT="200px" WIDTH="550px" ALT="XML Element Path" /> 
I truncated the image source attribute, just because it's so long. Images have to have a source, alternative text, and height; they may require alignment and padding. In XML, attributes need to be declared:
<!ELEMENT IMG EMPTY>
<!ATTLIST IMG ID ID #REQUIRED>
<!ATTLIST IMG HEIGHT CDATA #REQUIRED>
<!ATTLIST IMG WIDTH CDATA #REQUIRED>
<!ATTLIST IMG ALIGN CDATA #REQUIRED>
<!ATTLIST IMG ALT CDATA #REQUIRED>
The ID and CDATA indicate the type of attribute data allowed; here's a list of acceptable attribute types and their meaning. The extension #REQUIRED means that all instances of IMG have to state ID, HEIGHT, and so forth. The declaration may replace the extension #REQUIRED with a quotation, "center" (for alignment) which specifies the default value for that attribute. In other cases, you may wish for the author of source documents to specify such-and-such an attribute, but you don't want to mandate it. The #IMPLIED extension allows an attribute to be declared, without requiring it. The #FIXED "*" extension pegs the attribute at *, regardless of what the author puts.

A third is the declaration of entities, which are essentially surrogates for data, analogous to variables or "insert x here" memos in a manuscript. Entities are [usually] invoked with an ampersand (&) and closed with a semicolon (;), e.g., &HEADER; In some cases, as with predefined general entities (like &lt; for "<"), it's possible to avoid declaring them entirely under most conditions. The less trivial case of the internal general entity is fairly simple to declare:
<!ENTITY AUTHOR "James R MacLean"> 
A single change of the declaration allows one to replace my name with that a new author. Or one can include a complex patten of chapter headings complete with lines, images, and references to elements.

External general entities are a little more complicated to declare since they must include an universal resource indicator (URI), which is just a URL that ends in the actual name of the resource (e.g., a named section in a file, or an image).
<!ENTITY HEADER "URI"> 
One much-cited advantage to this is that it allows a web designer to generate a web page with source files located in multiple locations. Another is that it allows feeds, like Atom & RSS.

Internal parameter entities are entities that are used in the declaration itself; they are handy for standardizing elements by allowing all of them to have certain parts of their declaration changed at once. Just as an internal parameter entity is invoked with a % rather than an &, so the declaration differs from that of a general entity by using a %.
<!ENTITY % PERFORMANCES_CONTENT "(DATE, (WORK_TITLE| COMPOSER| PERFORMER)+)"> 
The above allows one to alter the required children elements for each element that declares (%PERFORMANCES_CONTENT) as its child elements.

OBSERVATION: While XML doesn't define elements, and it's possible for you to create elements like "SNOOKUM" and "UTTER_HUMBUG," defined with the most eccentric characteristics imaginable, the format and terms for declarations are established. There are some limitations to XML's tabula rasa approach. You can't declare "GROOVINESS" as an attribute of the EMPTY element WANKRIDER, although I suppose you could submit it to the W3C for serious consideration.

REFERENCES: Airi Salminen, Frank Wm. Tompa, "Requirements for XML Document Database Systems" (PDF- Nov 2001), esp. p.5; W3C, "Extensible Markup Language (XML) 1.0" (1998)

Elliote Rusty Harold, XML 1.1 Bible, 3rd Edition, Wiley Publishing (2004); see §8: "Element Declarations" and §9: "Attribute Declarations"

Labels: , , , ,

0 Comments:

Post a Comment

<< Home