OK, now that we've established what we are talking about (in this document) when we say "semantic web", let's hurry along and cover some of the architectural issues that are involved with actually implementing the semantic web.
Centralization?
First, and perhaps most important, of the architectural challenges is that of centralization. Nobody wants to store their data on centralized servers, because then their data would be vulnerable to
Byzantine threats (like governments and floods). This is especially true of metadata -- if you put up a web page criticizing a political candidate, it's unlikely that many people will see your page without a serious effort on your part. On the other hand, if you add an annotation directly to the candidate's home page, anyone using an annotation viewer will see it when they visit that candidate's home page. Some candidates could be convinced that blocking such comments would be beneficial to their campaigns. Such concerns mean that metadata which you intend to share with others will probably want to be on servers that are not too centralized, and probably not on the servers hosting the data you are tagging (Of course, if you own the web pages, you could embed the metadata in the page). Metadata storage techniques that are vulnerable to
Byzantine threats may not inspire enough user trust to reach a critical mass.
Besides threats to privacy, there is another problem with metadata being centrally stored or indexed. Tools depend on the ability to query metadata (for example, a tool might ask for the top five recommended links associated with a certain page). Having a standard way in which tools can query metadata servers allows for tools to be independent of servers, and encourages tool development and service development. However, if there are only a few central indexing services, there will be a strong incentive for these services to compete by differentiating their services, adding features, and "improving" the query interfaces. It is likely that reducing the number of metadata services would lead to tools that are tightly-coupled to one service or another.
Scaling Metadata
Scalability factors are where the differences between metadata and data become apparent. First, the location that web pages are published has little to do with your ability to experience those web pages. As you follow hyperlinks throughout a site and on to new sites, your browser contacts each location as appropriate and gets the file that you want to look at next. You browse one page at a time, more or less. On the other hand, metadata is not something that you browse "one piece of metadata at a time". The truly interesting metadata is useful only in aggregate. Consider the case of a rating system. Suppose that one hundred different users have all rated the www.ibm.com page with a number between one and five and have stored their rating in a standard file format on their own personal web sites. Would you want your web browser to contact every user's site and combine all of the ratings each time you browse to www.ibm.com? Of course not! It is clear that (unlike web pages) metadata is not interesting when considered grouped by publisher, but is interesting only when grouped based upon the thing which the metadata describes. But the most natural way to group things based on the thing being described (the web pages) is to just store the metadata with the web page, and we've already decided that is a bad idea. Web page owners have too much incentive to remove metadata that they do not agree with.
Decentralize Data, Centralize Metadata
In fact, metadata is intended to be indexed, searched, and processed by tools, while web pages are designed for humans to view one page at a time. The best way to index something to allow fast searching and processing is to centralize it. If you consider Napster, you can see that the metadata is centralized for fast querying, and the data is decentralized for fast retrieval. This design rule, "decentralized data, centralized metadata", can be seen on the web today. At the same time as HTML pages are being decentralized further and further from the core through services like akamai and inktomi edge caching, searching functions on the Internet are being centralized into ever more powerful engines like Google. Note that it is possible to have actual metadata stored in a decentralized way, but the indexing and querying of that metadata will be most efficient if centralized.
What it Means
Users will demand systems that perform well, and as we push the limits of what we do with metadata (multiple levels of inference, for example), the performance strain of querying metadata will become increasingly greater. Because the performance requirements will increase as user expectations of features increase, people will tend to be drawn more to metadata services that store metadata centrally. At the same time, the net's libertarian ethic will mistrust anything that is too centralized. The social desire for decentralization will probably be no match for the consumer demand for features and performance, however, so it is likely that most collaborative metadata functions will end up collecting in few centralized services. Developers of shared metadata services will need to design with the expectation that the inexorable forces of nature will tend toward centralized metadata, and build in safeguards to make sure that their systems cannot become too centralized (and thus become exposed to byzantine threats,
jeopardize critical mass of user willing to use the service, etc.) This fundamental characteristic of shared metadata to flow to the center is also an opportunity for the software politicians. As market forces drive metadata into centralized repositories, these centralized services will become wonderful targets for conspiracy theories and other such sensationalized portrayals that help the software politicians maintain flock size.
Metadata Explosion
The final scalability challenge threatening to cloud metadata's bright future is the probability that the amount of metadata being generated will be far greater than the data being tagged, and the number of queries made against metadata could exceed the number of queries made to request the related web pages. You can easily imagine the combined size of annotations and comments about the www.microsoft.com home page being much larger that the size of the web page itself. Not that this would be a problem, but remember -- we are talking about centralizing this stuff at least a little. You might take hope from the fact that Google actually keeps complete local copies of every page that it indexes, so in a sense Google has already decentralized the entire web onto its storage arrays. But there are two things that suggest this is not the most instructive example. First, storing a page is not the same as storing dynamically changing metadata about that page (which may exceed the size of the page) without having that metadata go stale. Second, buying enough permanent storage to cache the entire web is the sort of expense that would guarantee that only a few centralized players would be in the market, and this overcentralization would tend to defeat the idea of "the semantic web".
It's About Communicating Your Opinions
It seems like it will be pretty challenging to scale this "semantic web" thing. But before we give up, let's step back and think about an analogy that holds especially true for metadata. Consider the possibility that all of this metadata stuff is just a way to say "communicating your opinions about web pages". Think of the examples given in the "Reinvigorating the Web" section. Even when you categorize an article as being about "Sea Mammals", that can be thought of as your (wise) opinion. Other people might categorize the same article in the "Medicine" category because they believe that dolphins possess healing properties that can be harnessed by grinding up dried dolphin fins (it's a stretch, but you get the point I hope). The example earlier of a tool which queries for company information based on the owner of the site you are viewing (in the example, information on CNET when browsing news.com) is equally "opinion". If the tool uses a fairly respected source to determine that news.com is owned by CNET, then you can consider that opinion to be quite authoritative, and not really an opinion, but it's more useful to think of metadata as a bunch of opinions about things, some of which may be more authoritative and others more speculative. That's the way everything in the world is, anyway. Now, you may be thinking, "What about the tool that scans the web page for bands to look them up at RollingStone.com? It is using a grammar parser and natural language processing, so how could that be an opinion?". To that question I reply, prepare to be shocked the first time you are reading a web page about the Virgin Mary and your web browser redirects you to the web page of the singer "Madonna".
Anyway, I promised that you would get some insights about scaling shared metadata by thinking of shared metadata as being shared opinions. So let us start by looking at some ways that people share opinions.
Message Boards (HTTP)
Places like epinions.com, yahoo groups, and MSN communities allow people to post their opinions about all sorts of things, from products to politicians. Once you post your opinion, it usually shows up right away. Other people need to go to the site and specifically request your opinion to see it, though. If you have your own web server, you can always write your opinions into an HTML file and save it to the server as well.
Mailing Lists (SMTP)
Some message boards allow you optionally to have the messages sent to you as e-mail. Mailing lists allow you to e-mail your opinion to a group of people who are interested in (or at least tolerant of) your opinions. Some mailing lists also archive e-mails to the web or allow you to post to the web, but there is an important difference between the web-based message boards and e-mail lists. With an e-mail list, you do not get to see the other person's opinion the instant they send it, but you don't have to go poll a central web site to get the opinion, either. Using e-mail pushes the information closer to you and adds a little bit of latency.
Newsgroups (NNTP)
When you post your opinion to a usenet newsgroup, people on the same server as you will see the message pretty quickly, but someone on the other side of the globe may have to wait much longer as the NNTP servers pass the opinion along. You get higher latency, but you don't have to have copies of all the opinions in your e-mail inbox. The other important thing about NNTP is that the system is designed to allow multiple distributed write masters to create data, and then the data is replicated around the system. A majordomo-style mailing list or an HTTP-based message board both involve central write masters. The multi-master approach of NNTP makes it a good choice for sharing opinions when there are many people participating and huge volumes of opinion traffic.
Chat (IRC)
With chat, everyone sees your opinion pretty much instantly. You need to keep the number of people in a chat room small, and messages terse, though, because things would get overwhelming if you were passing along the quantity of information that goes over a large usenet newsgroup. To get extremely low latency, you give up scalability as well. Large IRC networks will add multiple write servers and synchronize between, which adds a small amount of latency.
Push, Retention, and Latency
There has been some hype in recent years about "push" technologies. To me, the difference between "push" and "pull" is this: when a system delivers a message to a system that you own, then it is called "push". When a system keeps the message on a machine that it controls, we say that we have to "pull" it down when we want it on our own machine. "Pushing" data to systems that want it (as is the case in the SMTP and IRC examples above) is pretty expensive, so most systems that push data like to delete the data once they are done pushing. For example, how easy is it to go back to an IRC server and see what messages were sent yesterday? On the other hand, web-based message boards and NNTP servers tend to keep copies of the messages around for longer periods of time; that is, they have more liberal retention policies. If we define latency as being "the amount of time between the moment that someone publishes an opinion until the time that the opinion is sitting on a machine that you own", the choice of technique obviously has an impact, although it doesn't seem dependent on push or retention. You can technically keep hitting the "reload" button on your browser for a web-based discussion board and get posts the instant they arrived. You would probably irritate the board owners, and that is certainly not the most scalable use for a web-based board, but it's possible.
Just in case you ware wondering what these communications techniques have to do with metadata, consider what it would be like if everyone who posted their opinions was polite enough to use a standard format for their opinions telling who or what the opinion was about, what type of opinion it was, and then the opinion. As long as people always used this simple format, tools could read through the text of the opinions and do cool things. For example, you could write a tool that reads through the newsgroups for all opinions with the "about" field set to "www.ibm.com", the "opinion type" set to "rating 1-5", and average the values to see what people think. Making sense?
Aggregation and Filtering
Aggregation and Syndication
Going back to the scaling problems of
metadata, you'll notice that most of the above communication techniques involve some amount of centralization. It is possible to do multi-party chat in a more decentralized way than IRC (MSN IM, AIM, etc.), and it is possible to share more permanent messages and documents collaboratively in a more decentralized way than NNTP (Groove), but these techniques aren't designed to scale into large numbers of users. This is a basic law of scalability; the more individual nodes that allow updates, the more expensive it will be to merge those updates together later. Systems like NNTP decentralize in the directions of the user as much as possible, to allow faster retrieval. This is similar to the way that akamai edge caching moves the data closer to the users who demand it the most. At the same time, however, the number of NNTP servers per user remains low (like 0.0001), and updates are batched to "flow" throughout the network of servers. You could put an NNTP server on everybody's personal machine, but then everyone would have to wait a year or two before seeing someone else's post. In general, we call it syndication whenever one source of information publishes that information to other sources, and we call it aggregation when one site pulls information from multiple sources. Syndication could be seen as a decentralizing flow, while aggregation would be a centralizing flow. But typically syndication sources publish to other aggregators, and the users are eventually required to go to an aggregator and retrieve the information, rather than have it syndicated directly to their desktops. This is the case with NNTP, where one NNTP server may serve as a source for many other servers, but users generally poll the server for new articles rather than having an
entire newsfeed delivered to their desktop. The distinction is sort of fuzzy, but it's mostly the same issue described with "Push, Retention, and Latency" discussed earlier in this document. Often sources that syndicate data will syndicate an entire article. On the other hand, aggregators will sometimes aggregate metadata (like headlines and descriptions) and simply point to the actual location of the article. You can think of NNTP this way -- when you post an article to the newsgroups, the entire article is copied to thousands of aggregators world-wide. If you take this analogy a step further, when you use a newsreader to connect to a news server, you usually only collect the titles of the articles and pointers to some sort of ID that you can use to grab the
entire article from the server if you want. In other words, NNTP servers past mostly data amongst themselves, and newsreaders extract metadata from the news servers to start. (Now, if the distinctions aren't fuzzy enough for you yet, remember that we are talking about sharing opinions, and we are pretending that all newsgroup postings are really opinions about some web page or another. So the articles themselves are metadata, and the list of topics being downloaded by the newsreader is technically metadata about metadata. This should make you realize that the distinction between metadata and data is fairly arbitrary, and that there is nothing wrong with taking meta to multiple levels -- in fact, in this example it is a good architectural decision. If that makes sense, then you are a very smart person or I am a good writer, or both.)
Aggregation can also be used in the sense that databases aggregate data by summarizing, averaging, or other similar functions across groupings of the data. For example, an NNTP server could collect all of the opinions posted to a newsgroup that give IBM a rating between one and five, and then combine all of those messages into a single message with just the average rating before sending the information to another NNTP server.
Channels and Subscriptions
The other important thing about all of the communication techniques listed earlier is that the user is never forced to cope with all of the data going over the network. For example, on IRC you only join the channels that you are interested in. You only go to web-based message boards that interest you, you only read newsgroups that you find interesting, and so on. This process of selectively subscribing (or joining a channel) to get at interesting opinions is basically a method of filtering. In fact, the filtering can take place between servers on the network. For example, a company running a corporate NNTP server may choose to filter out all of the alt.* newsgroups, but keep the rest.
Decentralize Data, Centralize Metadata, Decentralize Meta-Metadata
Ok, so the title of this topic is a bit stupid. I wanted to keep continuity with the "Decentralize Data, Centralize Metadata" theme earlier, but convey the idea that what will really be happening is a series of flows involving centralization, aggregation, and subsequent decentralization. In the examples we have given above, the process of deciding how decentralized the data gets (to push it closer to users for fast retrieval) and deciding how to syndicate and aggregate the data is quite manual. That is, humans at various points decide to establish an NNTP or IRC server, decide which other servers to connect with, establish filtering rules, retention policies and so on. The current trend toward edge-caching of content through techniques like inktomi or akamai is actually very similar to the aggregation/filtering cycle, though. In a sense, the edge caches act as aggregators of web content, and they filter the content they aggregate based on popularity (which is a bit of semantic metadata that they have to determine dynamically). Edge caches available today are not nearly as sophisticated in their range of options as are the manual syndication and aggregation systems, but they are able to operate largely without human configuration. As edge-cache intelligence continues to increase, and as aggregation and syndication systems add more intelligence, it is possible to imagine these two techniques converging.
Architecture Summary
The semantic web wants to be decentralized and open. This is not to say that centralized uses of metadata tightly-coupled to tools are a bad idea -- in fact, there are plenty of good reasons why such uses might evolve and serve user needs. However, those applications are not the topic of this document. This document is about "the semantic web", and the semantic web is by definition inclusive and interoperable. Creative application of aggregation, syndication, and filtering techniques will probably be necessary to scale the semantic web appropriately without sacrificing inclusiveness.