[Chapter 1] 1.2 Current Web Techniques Are Inadequate

1.2 Current Web Techniques Are Inadequate

Sure, anyone with Microsoft FrontPage can put a human resources policy manual on the Web, but creating production web sites with existing web technology is simply too much work. Three broad problem areas in current web technology make it hard to build these new applications:

Content management: Although web servers are good at presenting content, they are bad at managing it. This is partially due to their filesystem-based architecture, which often does not include the ability to build searchable, maintainable, and auditable information systems.
Application development: A production setting requires tools that can scale both up and down, fit the needs of a specific user base, and are part of a complete developmental framework. Few, if any, current development techniques meet these criteria.
Application integration and electronic data exchange: It's too hard to integrate different systems. To make e-commerce and ERP a reality, a platform must provide a simple method to link different applications, whether they all reside in one site or are spread across multiple sites. Current web servers are only just beginning to address this issue.

Let's look at these problems in detail.

1.2.1 Content Management

Web servers are great for making information available to a wide audience. Unfortunately, they do very little to help web site developers manage all this information. An ideal platform would help us develop sites that make it easy for users to find what they are looking for, are easy to keep up to date, and allow easy tracking of site content changes.

1.2.1.1 Finding what you need

There's universal agreement that good web sites make it easy to find what you need. Unfortunately, the filesystem architecture of most web servers makes it difficult to put searches into specific, meaningful contexts.

Filesystems are used to manage files on the operating system level. To make it easier for users to find their files, the system automatically keeps various attributes, such as the file's name, size, creation date, and owner. When we create a spreadsheet, for instance, the system saves it and its attributes within the file structure. Later, if we forget the particular name of the file, we can search for it based on its attributes. For example, in DOS we can enter dir *.xls /s to find all Excel spreadsheets within the various subdirectories, then look at the name or date to find the file we want.

This works great when you are sitting at a command line looking for a file that you created. The model breaks down, however, when you attempt to extend it to the Web. When people are searching for files on the Web, they don't care about the file's name or size (unless they're using a 14.4 modem!). Since they care about the file's contents, not its properties, the attributes maintained by the filesystem are largely irrelevant.

Some sites use search engines to overcome this shortfall. When a user enters a search term, the engine churns through an index of all the documents and returns a list of links to the files containing the search terms. Some of the most popular sites on the Web, like Yahoo! or AltaVista, attempt to do this for all files on the Internet.

You only have to look for the term "sexual reproduction" on a web search engine to see how laughable this effort really is. While keyword searches can be helpful, they almost always fail to put the search into a meaningful context. For example, suppose I want a list of all works by Harper Lee. I should be able to enter something like "Give me a list of all works where Harper Lee is the author." With a keyword search, however, in addition to her only book, To Kill a Mockingbird , I'm likely to get dozens or hundreds of additional documents ranging from a brochure about Harper's Ferry, West Virginia, to a retrospective of Bruce Lee movies.

The simple fact of the matter is that effective searches on the Web require a broader, more flexible set of attributes than filesystems maintain. In addition to simply describing a file, a web server should automatically keep meaningful metadata [ 1 ] about the file's contents that puts a search into a specific context. This metadata should extend to all files, regardless of format. What if a document isn't ASCII at all or doesn't even represent a spoken language? For executable binaries, for example, it would be nice to be able to directly assign searchable attributes like "purpose" or "platform."

[1] Metadata is data about other data. For example, a file's size is metadata because it is data about the file, not part of the file itself.

1.2.1.2 Keeping sites up to date

A second problem with the current web server technologies' lack of integrated content management features is that it is too hard to keep a complex site up to date. Hyperlinks on the Web act as a mapping function between a logical name, like "Andrew's Homepage," to a literal file that resides on a specific machine, like C:\andrew\web_stuff \default.htm . These links, created through URLs, let us navigate from one page to another. The problem with this dual mapping system is that we have to make every update in two places, in the filesystem and in the URL. If someone deletes or moves a file, but forgets to change the corresponding URL on every page on which it appears, we are guaranteed to have broken links. It's probably impossible to manage this process manually on a large site.

This two-step process also creates extra work for the webmaster. Publishing a new document, for example, requires the webmaster to manipulate various files by hand: she must use FTP to copy the new document to the web server and must then edit an existing document (such as the home page) to add a link to the new file. While this process is fine for dozens, maybe even hundreds, of individual documents, it is unrealistic to expect to keep a site completely up to date when there are thousands, or even millions, of individual documents. Consequently, sites contain inaccurate information, broken links, and pages perennially under construction.

1.2.1.3 Tracking changes

Finally, current web servers don't have a way to automatically track all changes to a document. While some operating systems, like VMS, have automatic versioning systems web developers can exploit, most do not. Since the ability to audit changes is a fundamental requirement for any production information system, an ideal web system would handle it automatically.

Suppose a webmaster or an end user updates a file, and it turns out later that he or she made a mistake. How do we track down exactly what was changed and fix it? Most filesystems don't automatically maintain logs that let us reconstruct a complex sequence of changes. Instead, we must either rely on the webmaster's memory or reconstruct the sequence of events from backups. Filesystems are simply not designed to handle complex audit tracking.

1.2.2 Application Development

The Internet has also blurred the traditional line between applications and data to the point where it's unclear how to classify many sites. While a static HTML document is "content" and a Java applet is a "program," how do we classify hybrid systems that are a little bit of each? For example, a data warehouse might have a web interface that seems like a normal web site, but behind the scenes each page is generated dynamically by running a database query. Is this really a web site as we normally think of it, or is it closer to an application acting on underlying data? Although there is no clear agreement, the term content-driven web site , implying equal parts of data and application, is one of the best names for these sorts of sites.

In web parlance, the applications and programs that create content-driven web sites are called dynamic resources . Dynamic resources are unlike documents created with an HTML editor such as Microsoft FrontPage, although both types of documents are accessed over the Web using a URL, and both return an HTML document. A dynamic resource is a program that creates a page upon a user's request, not a static file that exists beforehand. While such a program traditionally generates HTML, it can create any type of content; for example, you could write a system to create a graph in GIF or JPEG format, using sales data stored in a database table.

As technology has progressed, it has become possible to create more and more complex dynamic resources. Once limited to simple operating system scripts, developers can now choose from a host of viable languages for creating content-driven sites: Perl, Visual Basic, C, C++, Java -- even COBOL or FORTRAN! In addition, web servers now support more sophisticated invocation methods. The list of technologies is growing longer every day: CGI, application servers, cartridges, Java servlets, Object Request Brokers (ORBs), and on and on.

The explosive growth of these different technologies and techniques has made it difficult, if not impossible, to select a single platform that can meet all of your current and future needs. Ironically, the overwhelming number of development options is one of the most unsatisfactory things about web development. How do you know which one to pick? Will that technology exist in five years? Is it a viable commercial product or someone's Ph.D. thesis?

The profusion of options has led to two related problems. First, no single platform can meet the needs of every type of application and user group. Second, developers have to use a variety of platforms, depending on the type of application they are building, which stretches their ability to become proficient with any particular technology.

1.2.2.1 No single platform is scalable enough

Current development platforms rarely scale in both directions. For example, suppose you develop a really slick web application for your department using Active Server Pages on Windows NT. Word gets out around the company about how great it is and hundreds of people want to start using it. Suddenly, your application, which was designed for use by 10 or 20 people, has to accommodate hundreds. What can you do to scale it up? Conversely, suppose you need to build a small, specialized system that is to reside on its own server. You know it will never have more than a few users. Will you really use a Sun Ultraserver to build it? No, you'll go with something smaller and more affordable. As developers, it's hard for us to remember that technology decisions should scale in price as well as performance.

1.2.2.2 Developers must know too many platforms

Ideally, developers should be proficient on just one development platform that can scale across different hardware platforms, from Intel to Alpha to Sparc, and operating system platforms, from NT to Unix to VMS. Unfortunately, this is not the case with current web server application development. Developers wander from one platform to the next, worrying, like Goldilocks, that "This one's too small" or "This one's too big," when they need one that's just right.

You must factor in the skill levels required by each option. One of the worst situations is that each platform requires its own specific skill set, so you wind up with a development team that is split along platforms. For example, you may have one group of programmers that uses Perl, one that uses Java, one that uses Oracle Forms, and one that uses PL/SQL. Since it's impossible to master all the techniques available on each platform, you wind up with systems that only a small group can support.

1.2.3 Application Integration and Electronic Data Interchange (EDI)

As if content management and application development aren't enough of a challenge, the new breed of application must seamlessly interact with internal applications and electronically exchange data with external systems. Data entered by remote users must synchronize with the production systems. Orders placed on your web site must flow into an order entry system, which must then send the customers email notifying them that their orders have been received. Purchase orders must flow from your system into the order entry systems of your business partners.

These types of tasks are well beyond the scope of almost all the web servers currently available. While it's possible to build this functionality, it is usually a kludgey process performed with uploads or downloads or, God forbid, rekeying the information by hand.

Web server vendors are attempting to address this problem by defining universal standards for interoperability and object-to-object communication; some of the most promising solutions, such as CORBA and COM, are already available. However, the battle over what will be the general standard is already brewing and promises to make the browser wars look like a game of touch football at a retirement home.