to nature home page debates1 October 1998

The XML Revolution


If you have ever peeked with the 'view source' option on your Web browser, then you're familiar with Hypertext Markup Language (HTML).

HTML was an overwhelming success because it fulfilled a dream that word processors, despite their myriad features, do not1. So,

"pick up your pen, mouse or favorite pointing device and press it on a reference in this document -- perhaps to the author's name, or organization, or some related work. Suppose you are directly presented with the background material -- other papers, the author's coordinates, the organization's address and its entire telephone directory. Suppose each of these documents has the same property of being linked to other original documents all over the world. You would have at your fingertips all you need to know about electronic publishing, high-energy physics or for that matter Asian culture. If you are reading this article on paper, you can only dream, but read on."

Now that dream is a reality, and human communication is vastly augmented by the Web: that is, as long as the communication consists of a title, headings, paragraphs, lists, tables and forms.

What about all the other communications idioms and document types that we routinely use to get our work, business, and play done?

  • Restaurant menus
  • Theatre programmes
  • Meeting minutes, with agenda items and actions
  • Cheques, invoices and purchase orders
  • Calendars and project schedules

Extensible Markup Lanuage (XML) is the evolutionary successor to HTML, in that "less is more". If you're thinking that XML is all the stuff from HTML plus a few more things, think again. It's the same pointy-brackets, tags, and attributes; but when it comes to tag names, the slate is wiped clean. XML is like HTML with the training wheels off.

Of course, you can imitate menus, programs and schedules with HTML, or you can put pictures or facsimiles of their traditional printed form on the Web. That's great because it allows you to share them with people all over the planet instantly. But it doesn't invite the computer to help you manage them.

The bane of my existence is doing things that I know the computer could do for me.

If the Web page with your personal calendar says you'll be in New York next Thursday, and the page with your workgroup calendar says you'll be in London all week, shouldn't the computer be able to warn you about the conflict? And shouldn't it go ahead and ask you if it's OK to cancel your flight to London and purchase this other ticket to New York?

As a medium for human communication, the Web has reached critical mass (I won't go so far as to say it's mature--there's plenty of work still to be done) but as a mechanism to exploit the power of computing in our every-day life, the Web is in its infancy. The Web now allows us to communicate our problems to one another faster than ever before, but does it really help us solve them?

XML is so simple that it just might work: it might revolutionize the ability of people to conduct commerce, express themselves, and generally get work done with computers and networks.

Website designers are doing some amazing things, but they often reinvent the wheel for any number of reasons. Order-processing systems make a good example: some web design shop, say, built one shopping-cart system, but can't use it, because

  • their infrastructure is Windows NT, and the system is based on Unix, or
  • Perl vs Java, or perhaps
  • the folks were just too busy to discover that had solved the problem, or
  • the system is aimed at a million transactions per day and requires thousands of dollars worth of hardware and software, while the folks only expect a few orders a week and can only afford a few hundred dollars, or
  • doesn't care to share its technology with the community either because
    • they don't want to lose a competitive advantage or
    • because they don't want to take on a support burden.

For all these reasons, it takes longer to develop effective websites than it should, and the community is looking for opportunities to share technologies and resources.

At the lowest level, organizations like the World Wide Web Consortium (W3C), The Internet Engineering Task Force (IETF) and The Object Management Group (OMG) are engaged in updating the transport infrastructure, HTTP. The aims are first to address some of the design shortcomings that 5 years of experience has exposed, and second to integrate better with modern software development. At the next level, the software development community is pushing the Web down into the infrastructure of operating systems and languages like Perl, Java, and Microsoft Windows. The goal of all this low-level stuff is that it "just works," like a lightswitch or a telephone.

But there's a twist: along with shipping your pages around, the computing infrastructure should take every opportunity to read, understand, and act on them. There's no reason to live with the status quo2:

"Hospitals have begun to offer the [home health care] agencies a solution that goes something like this:
  1. Log into the hospital's Web site.
  2. Become an authorized user.
  3. Access the patient's medical records using a Web browser.
  4. Print out the records from the browser.
  5. Manually key in the data from the printouts.

The knowledgeable reader may smile at this "solution," but in fact this is not a joke; this is an actual proposal from a large American hospital known for its early adoption of advanced medical information systems."

Manually key in the data? Can't the two systems be made to talk to each other? Never mind the multibillion-dollar medical industry; how often do you get a computer-generated bill, invoice, or airline ticket, and then manually key the information into your computer to manage your schedule or finances? Is this the best we can do? Not if the XML revolution succeeds.

Today, several major Web search services build big indexes. These are incredibly useful, but they're also limited: they don't know the difference between a book by Ben Franklin and a book about Ben Franklin, let alone the difference between an African beetle and a Volkswagen Beetle.

The search services do know which part of your page is the title, because the <title> tag in the HTML markup tells them. Why not just add <by> and <about> and <genus> and such tags to HTML? Because

  • technically, it would produce a mess: HTML is hard enough to process now, and if we make it harder, we reduce the chance that new tools will come along and make the Web smarter.
  • socially, it wouldn't work: the HTML specification is maintained by a small group of experts who are trusted to Do The Right Thing on behalf of the community; that small group doesn't have expertise in all subjects that may be covered by Web pages, and if we added that expertise to the group, it would be too large to function. It is much better to give everyone a tool that they can easily adapt for their own particular needs.

HTML was a critical first step, but it is, by design, a one-size-fits-all solution; it works well when applied to its original domain of simple structured documents with links, but doesn't work so well in all the other domains where people want the Web to apply.

XML, like the Internet and the Web, is designed to facilitate a marketplace of competing companies, innovative individuals, and organizations of all sizes in between. W3C is a consortium of 270+ member organizations committed to the growth of this marketplace, ensuring interoperability and smooth evolution.

This decentralized marketplace is already at work: to automate the exchange of bills, statements, and payments, the banking and software heavyweights are working on Open Financial Exchange (OFX); meanwhile, to automate exchange of information about chemicals, their properties, uses and suppliers, one researcher in Nottingham, Peter Murray-Rust, rolled up his sleeves, and Chemical Markup Language (CML) was born.

XML is intended to span this wide spectrum of application, and it has become a strategic technology in W3C, where members are sharing resources to compliment HTML with XML-based technologies:

  • MathML, for describing mathematics as a basis for machine-to-machine communication.
  • SMIL, for expressing media synchronization
  • RDF, for resource description, such as library-style cataloging
  • P3P, to use XML and RDF so users can be informed, in control, and make decisions based on their individual privacy preferences.

XML by itself is just a simple text format; but together with all the ways it's being used to share structured information, it's a revolution that promises to make the Web a whole lot smarter.

References 1. Berners-Lee, T., et al. World-Wide Web: The Information Universe in Electronic Networking: Research, Applications and Policy 1 2 (Meckler, Westport CT, USA, 1992)

2. Bosak, J. XML, Java, and the future of the Web. Sun Microsystems (c. 1 Oct 98) (1997).

Dan Connolly is the leader of the W3C Architecture Domain. He collaborated with Jon Bosak to form the W3C XML Working Group and produce the W3C XML 1.0 Recommendation.


Macmillan MagazinesNature © Macmillan Publishers Ltd 1998 Registered No. 785998 England.