XML: The Web Is Only the Beginning

by Stan Gunn

In its earliest days, many people considered the World Wide Web a fad. It was referred to more than once as "CB radio for the '90s." Few make that mistake any more. The Web has revolutionized the way information is delivered. For a growing percentage of the population it has become a primary source of information. Magazines, newspapers, dictionaries, encyclopedias and other resources too numerous to name have made their way on to this new medium. Libraries, from the largest research universities to the smallest towns, have rushed to provide access to their patrons.

But the rush to move information to the Web has been hampered by its humble beginnings. By now, most people have at least heard of the Hypertext Markup Language or HTML. Developed by a researcher named Tim Berners-Lee in 1991, HTML was originally written as a method for making academic papers more accessible. When Berners-Lee developed the first few tags that lie at the core of HTML, he had no idea he was creating a language that would be learned by millions in the coming decade.

For a variety of reasons, HTML came about at exactly the right time. The Web was the perfect application for the Internet, a twenty-year old, but still nascent network shared among universities and other research institutions. The Internet spread the innovations of the Web to a wide audience, and the Web caused the popularity of the Internet to explode.

Unfortunately, the Web, for all its popularity, still labors under the burden of a language that was written for a specific type of document. HTML has expanded since those early days, but mainly as a result of vendors such as Netscape and Microsoft throwing in tags that would distinguish them from each other.

Something better, something more expandable, would be needed for the World Wide Web and the services available through it to continue to grow. For this reason, the World Wide Web Consortium introduced a new standard in 1997. The Extensible Markup Language, or XML, provides a new framework on which to build a larger and richer means of transmitting information than the Web can currently provide.

XML is actually a smaller, easier to implement subset of the Standardized General Markup Language (SGML) which has been used for years for technical documentation by the U.S. government and many industries. To bring this story full circle, Tim Berners-Lee used his knowledge of SGML as the core of his work on the Hypertext Markup Language.

XML is not a replacement of, nor an addition to, HTML. XML is not a set of tags. Instead, it represents a language that can be used to define languages such as HTML. In other words, it is a meta-language that provides a framework on which innumerable new structures can be built. These frameworks, called Document Type Descriptions (DTDs), let users define a set of ground rules for their own tag sets and markup languages.

In essence, XML defines a logical and consistent way in which markup languages may be implemented. XML allows users to create a grammar under which a new language can be developed. Authors using XML define their own tags with their own meanings and rules for use. The DTD consists of a concise description that browsers and other software can use to interpret an encoded document.

Any group that has a common need to share information can do it via their own XML DTD. Insurance companies can use it as a framework for sharing medical records. News organizations can use it as a standard format for filing stories. Banks can develop a format for passing financial information.

By defining their own tags and structure, organizations can make their documents better reflect their data. For example, a customer record in HTML might look like this:

<H1>John J. Smith</H1>

<H2>8/17/1966</H2>

<P>2132 Mulberry Drive

Littleton, TX 75733</P>

Whereas in a XML document the same record might read:

<NAME>John J. Smith</NAME>

<BIRTHDATE>8/17/1966</BIRTHDATE>

<ADDRESS>2132 Mulberry Drive</ADDRESS>

<CITY>Littleton</CITY>, <STATE>TX</STATE>

<ZIP>75733</ZIP>

Why are such customized tags important? Because they lend intelligence to the document. Searching for specific information in HTML documents is difficult. A computer can look at an XML document and extract many types of useful information based on the demands of the user. In a sense, an XML file is much like a record in a database, but in this case the file contains not just the data, but definitions of the fields as well.

Just as importantly any document encoded under XML (or HTML) rules is human-readable. Tags are written in plain text. Even without a computer acting as interpreter a person can read an XML-encoded file. This is, perhaps, one of the greatest strengths of XML. Today's computer software stores files in a bewildering array of competing and proprietary formats. As a simple example, a Microsoft Word file will never translate exactly into WordPerfect. XML promises a change to that. Next-generation browsers will parse XML with the same ease Netscape and Internet Explorer read HTML today. Any document, created under any XML-compliant format, will be accessible.

In addition, XML documents don't rely on specific formatting to make text appear in a certain font or at a certain size. Tags developed under XML aren't used to generate tables or change the color of the background. While this may seem a limitation at first, it's actually a powerful benefit. The appearance of a file encoded with XML will be handled in a separate layer called a style sheet. Style sheets can be targeted to specific media and can be selected based on the device to which they are being sent. This means that, without changing the fundamental structure, an XML document could be engineered to be just as legible on a computer screen, the display of a cell phone, or the control panel of a household appliance.

Already, projects that rely on the open sharing of information between diverse groups have adopted XML. Projects are underway to encode court records, notation for mathematical papers, and research in the human genome.

The most notable use to date in the library field might be the Encoded Archival Description language, or EAD. This format began its life in 1993 as a proposed SGML standard, but today is fully compatible with XML. The stated purpose of EAD is to standardize the creation of finding aids such as inventories, registers, indexes, and other documents created by archives, libraries, museums, and manuscript repositories. The project began at the University of California, Berkeley but was taken up by the Society of American Archivists in 1995. The first official version (1.0) of EAD was released in August 1998. EAD is already in widespread use at institutions such as the Library of Congress, Duke University, Johns Hopkins Medical Institutions, the New York Public Library, and many others.

Despite all of these projects and advances, the road to XML still has a few obstacles. Even the latest generation of browsers has only limited support for XML. Users who want to access XML files must download viewers or rely on web sites that translate XML documents to HTML format.

As well, the complexity of XML far surpasses that of HTML. One of the reasons for the explosive growth of the Web is the simplicity of creating web pages. A bright grade-schooler with a word processor can write HTML. XML requires much more sophisticated tools to develop and a solid understanding to implement correctly.

Finally, it is quite possible that the ease with which organizations can create their own XML standards will lead to a Balkanization of formats. If every bank develops its own specialized tags and structure for financial transactions, the universality of the underlying system could be compromised.

If XML catches on, will the Web wither and die? Probably not, at least not any time soon. The relative simplicity of HTML and the pervasive popularity of the Web will keep the Internet humming for many years to come. And besides, the latest version of HTML was written to be just another subset of XML. Whether it replaces the Web or not, XML promises to deliver something that will benefit society for decades to comean open, universally accepted format for storing and transmitting information.

A few web sites to get you started:

Official site for information about the XML standard:

http://www.w3.org/XML/

One of many tools available for XML creation:

http://www.xmetal.com/

Information about EAD and its implementers:

http://www.loc.gov/ead/

Stan Gunn is director of network systems at the Texas Youth Commission and a doctoral student in Archival Enterprise at the Graduate School of Library and Information Science at The University of Texas at Austin.

TLJ ContentsTLA HOME