19 July 2007


(Part 1)

Disclaimer—I am not an expert in this topic, but a student. I am hoping that my notes may be useful to others researching the subject. Those of you who notice errors, please feel free to either contact me or leave a comment below.

The extensible markup language (XML) is a type of language used for creating web pages or, in some cases, GUI displays for non-web applications. XML is a very, very commonly used example of a domain-specific language. That's because XML allows a domain, such as a website, to include definitions of its terms. The same language, in another domain, will have different definitions and hence different results. By contrast, the hypertext markup language (HTML) used for coding in very simple static web pages is not domain-specific; anywhere you use it, it will produce the same results.

One of the peculiarities of XML that makes it so attractive for all manner of GUI views is that programmers define the tags. So, for example, suppose you are creating a chart that displays information about classical music concerts. You need to include the the name of the work, the composer, librettist (if there is one), the performing ensemble, the conductor, soloists, venue, date, and sponsor. You might also need to include the price of tickets. You could have a tag marked <work>, another marked <composer>, and so on (I've included an invisible period to prevent Blogger from converting them to actual tags). You would then have a stylesheet specifying what the tags mean.

Another advantage of this is, naturally, is that a computerized filing system can identify the data automatically. You may think having different typefaces for all those things is too busy and visually displeasing; but of course you can then define the tags to all look the same. The program will still be able to search for those items. Another advantage is that some systems of information, like chemistry or music, have peculiar systems of notation that arise in publications. As one might expect, there are implementations of XML for specific trades: XHTML, RSS, MathML, GraphML, Scalable Vector Graphics, MusicXML (all links to Wikipedia). The first one, XHTML (Extended HTML) is a melding of HTML functions under XML rules, allowing more efficient processing. XHTML has more demanding rules for use than HTML does, but it can be used efficiently as a universal code for creating GUI's since it can be processed far faster than HTML can.

When all the tags and their meaning have been defined in the stylesheet, we say the programmer has created an XML application. Part one includes mention of some commonly used applications; XHTML is the best known, but there are naturally hundreds of others recognized by the W3C standards group, and presumably tens of thousands created by organizations for proprietary use. Creating a new XML application is scarcely more difficult than locating an existing one that's right for any particular organization, especially when XSL exists to facilitate transformation into something the browser can display.

XML does impose some rules about how data is represented. There are two senses in which this is meant: well-formedness and validity. Well-formedness rules apply to the definition of attributes and elements (XML predefines none). First, when an HTML file contains errors, the interpreter tries to establish what the programmer meant to say. This actually creates wildcat HTML, in which coders write HTML based on how a browser will display it. Other browsers are designed to display it a different way, so the page will not display properly. XML nips this wildcat forking in the bud by insisting that malformed pages trigger an error.

Well-formedness rules include specifications about what tags are legal. All tags must close and, in XML, tags are cases-sensitive; tags also must be closed in order reverse to when they were opened. (See here and here for some basic rules of well-formed tags). The rules for correct coding in XML is much more exacting than in HTML, which allows XML tags to be treated as actual code rather than as scripts to be interpreted; hence, there is a requirement for a rigorous syntax and execution.

Markup languages are about more than merely tags; there are also elements, which include the basic building blocks of a document. In HTML, it's possible to create a document with no elements at all; for example, one could simply open Notepad, type
<html>Hello World! </html>
and save as .html, and that could be a web page. You could even include a few tags, e.g.,
<font color="#9966FF"></font>
and the result would be a perfectly splendid web page. In XML, this is not a valid way to create a document. Everything in the page must belong to an element, and there is a hierarchy of elements. An element may be a listing of some kind, or the body of text, or footnote text, or a title, or salutation. An element is opened by a tag, and must always be closed by one, unless it's an empty element (in which case it has the closing slash thus, <hr/>). Incidentally, those using Blogger may have noticed, in the Edit Html view, that <img src="*"> is always changed to <img src="*" /> after switching views or after saving; the reason is that Blogger is programmed to convert HTML 4 code to XHTML.

XML elements have to take a hierarchical form with a single root.

Elements, like objects, have to be organized into a single hierarchy of classes and subclasses, with each document element nested within higher-order elements. Attributes are the distinguishing features of elements; they are enclosed in quotation marks within the element tags (e.g., <font size="2">). For certain elements, such as pictures, some attributes have to be defined. XML defines no elements, but XML applications do.

Entities are named units of storage in XML. Conceivably, the entire document may be an entity, including associated files defining the XML elements. However, this is trivial example of an entity. Internal entities are defined in the document and may include something as simple as a symbol, or perhaps a footer. External entities include a universal resource identifier (URI) , which identifies precisely where the content of the entities is found. In some cases, the advantage of using an entity is that it may be used as a variable; changing the value in one place changes it everywhere it appears in the resulting document. Also, internal parameter entities can be used in the associated files to change what is a legal element.

Another important component to XML documents is the document type declaration (DTD), which includes a list of the elements required in a valid document; or, if there are multiple kinds of documents defined, specifies what must be included in each. DTD's are not the same thing as style sheets that define the tags; DTD's define the elements that exist, what attributes are allowed, and what entities are accepted. A valid XML document has this declaration, and conforms to it. The DTD is sometimes included within the page it describes, but more frequently it makes sense for the DTD to be a file (*.dtd) that serves the whole domain. Examples of DTD code can be found here.

something which is spontaneous and in violation of discipline. An example would be an illegal mine, created by miners secretly digging a side tunnel off the main pit. Wildcat mines are a major problem in China, and involved in hundreds of deaths each year. Other examples include "wildcat strikes," which are not authorized by the trade union, but may be merely a mutiny by a group of workers in one location. Here, obviously , a wildcat HTML is just a form of HTML that is not recognized as an HTML standard and which creates an effect not designed into the browser.

forking: the creation of a new, modified version of a software by a group other than the original publishers. For example, if someone takes GNU Emacs, alters it to recognize more character sets, and then launches it as (say) "Water Buffalo Emacs," that would be a fork. Hereafter, WB Emacs might compete for developer time and energy with GNU Emacs. See hacking.
SOURCES & ADDITIONAL READING: Wikipedia, XML; W3C Tutorials, XML; Jan Egil Refsnes, XML files;

BOOK: Eliot Rusty Harold, XML 1.1 Bible, Wiley Publishing (2004)

Labels: , , , , ,


Post a Comment

<< Home