Making a Semantic Web
February 11, 2001
Joshua Allen
joshuaa@microsoft.com
Introduction
If you've paid any attention to the web standards discussions, you may have heard the phrase "Semantic Web", or perhaps even been pressured to use standards with names like "Dublin Core Metadata" or "RDF". If you've attempted to read any of the available documentation on these topics, you were probably intimidated by terms such as "reify" and all sorts of artificial intelligence concepts. This document attempts to explain what all of this chatter really means, and help you decide which parts you should care about and why. I have tried to use common-sense, real-world examples and stay away from complicated terminology. Please contact me if you find significant omissions that you would like me to correct, or if you find certain portions of the explanation unnecessarily complex.
Disclaimer: I am not responsible for evaluating any business strategies related to the subjects covered here. I was not asked by my employer to create this document, and nobody at my company endorsed or reviewed it. This is simply my own personal perspective after recently researching this topic, and I reserve the right to completely change my mind about the opinions expressed here at any time. It is nearly Valentine's Day, and nowhere in this document do I use the word decommoditize. This paper is copyright Joshua Allen. I wrote this, and if you try to change it or claim credit, you will be taunted mercilessly.
Reinvigorating the Web
The web was pretty revolutionary, right? Before the web, we had systems like HyperCard which let us create documents and hyperlink those documents together. But the web was world-wide. Anyone with a server could publish documents for the rest of the world to see, and you could hyperlink any document to any other document. It didn't matter if the page you were browsing was being served up by some guy in Kuwait from a Unix server, and your web site was running on a Macintosh in Boston; if you could browse the page, you could link to it. These early days were exciting indeed, but we've been excited about what is basically the same old web for the past ten years. Hyperlinking to everything in the universe is cool, but it's become rather boring. Now that we have all of these document linked together every which way, isn't there something more we can do with them?
Some Examples
Here are some examples of things that could be done to make the web a better place.
Context-aware Links
Suppose that you are browsing a web page that uses the word "reify", and you want to know what that means. You also happen to know that you can look up any word by using a URL like http://www.dictionary.com/cgi-bin/dict.pl?term=someword (Replace someword with the word you want to look up). If the author of the page had added a hyperlink like this to the word "reify", you would be able to click on it to look it up. But why depend on the author? Why doesn't your web browser let you highlight a word, and then allow you to select a command to define the word, which just replaces someword in the URL above with the text you have highlighted? Then you would automatically be taken to the definition of the word "reify". After looking at the definition, you might still be confused by what that word really means, but it would be a neat feature, right? Suppose that the web browser could go a step further, and recognize that the phrase Grateful Dead refers to a music group. It could then link you to RollingStone.com, even if the original author of the page hadn't bothered. In fact, such tools are available today. Though it is not the first or the only example, Comet Systems Smart Cursors is one such rudimentary tool.
Site Information
Now consider the case of browsing a page like www.news.com. What if the browser were able to detect that you were at a news.com site, and provide you with a menu option "About the owner of this site" that could pull up company information about CNET (the owner of news.com). This is similar to the above, but in this case the browser is being smart about the whole site or domain instead of words within a page.
Collaborative Filtering
Next, pretend that your browser has two buttons in the toolbar, a "thumbs up" and a "thumbs down". Now any time that you are browsing the web, if you find a page to be particularly terrible, you hit the "thumbs down" button. On the other hand, if you find a page to be so inspiring that you can't restrain yourself, just hit the "thumbs up" button. Each time you hit the button, your disposition is recorded somewhere. Now when you browse to any page, your browser can also give you helpful hints in the status bar, like "80% of your friends gave this page a thumbs-up", or "90% of your friends say that this page stinks". In fact, search engines like Google could use this information to rank search results. Note that you could also use rankings (e.g. one to five) and calculate average ratings, etc.
Collaborative Categorization
Imagine you are browsing the web and you see a page that gives an excellent description of dolphins. You want to remember it for possible later use, so you go to the "My Encyclopedia" menu of your web browser and select "add page". You're presented with a dialog box asking you to select categories for this encyclopedia article, and you enter "Dolphins;Sea Mammals". Your metadata is stored on a central server somewhere. Now any time that you or anyone else goes to the "My Encyclopedia->Browse Articles" menu and browses to the "Sea Mammals" or "Dolphins" categories, this article will appear. This technique could be used for directories or any other categorization scheme as well.
Annotations
Ok, so what if it's not enough to share ratings with people? Imagine that you are browsing the home page of a fellow Asheron's Call gamer, where he claims to be the best player on the 'net. You know that is totally bogus, so just use your browser to add a comment to the page (actually, you probably don't have permissions to edit that guy's page, so your annotation would be sent to an annotation server somewhere). Now when your friends browse to the same page, they'll be able to see your rebuttal as well (because their web browsers will automatically contact the annotation server to see if there are any comments), explaining why you are a much better player.
Related Links
Another sort of annotations are related links. Imagine if your web browser allowed you to recommend related web pages for every page that you view. Any time that you view a web page, your browser could show you the top five related pages recommended by other people. Obviously there are many variations on this concept.
Corrections
Finally, one potentially controversial tool would be something that allowed you to change the content of other people's web pages. Your changes would probably not be applied directly to the page, but saved in a separate and independent annotation server. Then, when anyone browsed to that page, the browser would download the changes and apply them on-screen. Clearly, you would want to limit who was allowed to modify pages that you viewed and perhaps to be able to distinguish modifications from the original text.
Is that All?
The list of examples above is certainly not a comprehensive list of all the cool things that can be done with the web. There are many innovative changes taking place on the web right now that do not necessarily fit the common thread of the examples above. To be clear, we aren't talking about everything that could be done with the web. In this document, we're talking about tools and metadata.
It's About Metadata and Tools
One thing that all of the previous examples have in common is that they each involve the use of some extra data about the page that you are viewing. Metadata is the word used to mean "data about data". For example, if you record a "thumbs up" rating for the site www.espn.com, your rating of the site is metadata about www.espn.com. In these examples, the purpose of the extra data (or metadata) is to add some meaningful information to the web pages so that tools can do something with it. To the tools that use this metadata, the information it holds is meaningful. You see, you could just as easily post your comments about various websites to all sorts of discussion boards and newsgroups, but then it would be pretty difficult for your web browser to figure out what people think about a certain web page. So, when we talk about extra data that is meaningful to tools, we sometimes describe that information as being semantic (semantic just means "meaning").
How Does Metadata Get There?
So now that we have defined "semantic metadata" as being extra information about a page that allows tools to do interesting things, let's discuss the different ways that this meaningful information gets created. Also presented are some possible motives which might encourage someone to create metadata.
On The Fly Text Parsing
Some tools will attempt to discover meaningful information by scanning through the text of the document as you are viewing it. For example, a tool could use a lookup list of all popular music bands and look for those band names within the text. It is also possible to create tools that understand grammar structures in the sentences of the text and can guess information based on that information. Microsoft's MindNet and Princeton's WordNet are two databases that might be used to understand the structure of sentences. The MindNet and WordNet lists of words and relationships are meant to cover the entire spectrum of commonly-used words. In some cases a tool may use a specialized list of word meanings and grammar specific to a certain domain, such as healthcare.
Tools that extract semantic information on the fly might be provided free with software as a value-add, used to direct traffic to partner sites, or sold directly as software.
Embedded in Page
Content publishers can embed metadata directly in a page. One common example would be information used by filtering software. Content providers could mark certain material as being for mature audiences only, and then various filtering tools offered to protect children would know to block those pages. Publishers would have an incentive to embed such metadata to minimize risk of legal trouble. Publishers might also embed metadata simply to make their pages more useful and thus add value to the site. Obviously the publisher must anticipate that there exist tools which can do something useful with the particular metadata being added, or else the extra work of adding the metadata would be pointless.
User Published Files
A user might create metadata files describing something on the web and place those files on a web server that she owns. This is sort of like embedding the metadata in the page, but since she doesn't have access to the site with the referenced pages, she just puts the metadata on her own site and links it. Obviously, tools would not be able to do much with this metadata unless the tools knew where the metadata was or were configured to contact a web crawler/search engine which knew how to parse the metadata files.
Service Provided by Site
Some sites will offer users the ability to add their own metadata directly through an interface provided by the site. For example, most articles on the Microsoft Developer Network (MSDN) allow people to add comments to the bottom of the article. CNET's news.com also allows people to add comments to news articles. Both sites use different web-based interfaces for adding comments, and both sites store all of the user-entered metadata in systems controlled by the site.
Batch Text Parsing (Crawler)
Search engines such as Google crawl around the web extracting information about the pages that are found. In fact, Google is able to determine quite a bit of meaningful data about a page beyond just the HTML data. Google is able to determine roughly how many other pages link to a particular page, and thus rank it appropriately in search listings. Google knows how many times specific words occur, and knows which pages are similar to a given page. Crawlers can also parse text as described in the "On The Fly Text Parsing" section above, or read metadata embedded in a page being crawled. Metadata crawlers might be operated to add value to other services, sold on some sort of per-use of subscription basis, used to drive traffic to partner sites, used to discover information that can be sold to marketers, and so on.
Specialized Metadata Server
Sometimes tools will use their own servers to store and query metadata about documents. For example, Third Voice is a tool that allows users to store annotations and related links about web pages that they visit. The information that users enter into the tool is stored in servers run by the company which provides the tool. The metadata server is useful because it allows all of the users of the service to collaborate and share annotations. Generally, people who enter metadata into such a system are doing so because they expect some benefit from collaboration. Such a system could offer cash rewards for people who write frequently-used annotations, or people might be motivated to create content simply to "share with the community". Others might use the metadata servers as a way to broadcast information that they wish to be seen by the public.
Generic Metadata Server
Finally, it is conceivable that a generic metadata server could be created which can act as a storage and retrieval engine for all sorts of tools that require shared and collaborative metadata. For example, a system that allowed users to mark pages as not being suitable for children could use the server to store URLs of "unsuitable" pages, and another tool might use the same server to store information about whether particular users said that a web page was a "thumbs up" or "thumbs down". Perhaps later tools could combine the metadata into even more compelling tools. For example, your Google results could return only pages that were suitable for children and got "thumbs up" more than 80% of the time.
How Do Tools Query the Metadata?
"All of this is fine", you say, "but letting people or computers add extra meaningful data to web pages is only half of the picture. What's next?" Maybe the more important half is how those tools get at all of this meaningful data to do something with it. If we assume that we can read the metadata (with web crawlers, etc.), we still need to tell our tools how to request metadata about a certain page. (If the metadata has been parsed on the fly while the user is loading the page, or if the metadata is embedded in the page, then we can assume the tool already knows the metadata for that page. When we talk about a tool querying the metadata, we are talking about the other cases above, when the metadata is stored separate from the web page).
Individual Sources (Go Get It Every Time)
It is possible that your tool could download a file (or files) with the metadata directly from each of the places that the metadata is stored. For this to be very smart, your tool would have to know where the metadata was stored and there would have to be very few locations. Neither of these criteria are likely to be the case in many situations.
Personal Database
Think of this as "your own personal Google". You have some sort of query engine that collects metadata from various sources (any sources listed in the previous section on "how metadata gets there"), indexes it for fast lookup and allows you to query that information. A benefit of this technique is that you could limit the database to store only information that you are interested in.
Shared Database
Tools would use some common API to ask the shared server for metadata.
The Semantic Web
Hopefully by now you can see why intelligent tools using metadata are going to improve the web. The above list of examples and motives for use of metadata is by no means complete or comprehensive; the possibilities are pretty much unlimited. Now that we've given some examples of how metadata can be used, though, let's try to narrow things down and determine which uses of metadata are being talked about when people talk about "the semantic web". It is true that even the concept of "the worldwide web" is rather amorphous, so it is practically impossible to draw the line and say "this is where the web ends and the non-web begins". The line I draw here is necessarily arbitrary and subjective, but it is a good starting point for you to think about these things on your own.
Globally Inclusive
The web didn't take the world by storm by being the first hypertext system. There were plenty of hypertext systems around before. The reason that the web became ubiquitous is because it allowed anyone to link to any other document and allow everyone to see their page, so long as they had a simple Internet connection and a place to put a web page. You don't have to get permission from someone to hyperlink to their page, and you don't have to get permission from people who link to your pages to change or remove the pages. In the same spirit, there are many systems today that allow metadata to be stored and indexed -- these systems could be compared to the HyperCards of the pre-web days. Until anyone can create metadata about any page and share that metadata with everyone, there will not truly be a "semantic web".
Collaborative
Metadata can be useful for many things. But unless metadata is meant to be shared with others, it isn't about "the semantic web".
Interoperable Metadata
Imagine what life would be like if Netscape browsers used a completely different web page syntax than Internet Explorer uses. Of course, both browsers differentiate on certain features in order to compete, but it is in both browsers' best interests to remain interoperable and just ignore tags that have no meaning to them. If interoperability is so important in a field which is largely dominated by two browsers, how important will interoperable metadata be in a field which is likely to have scores of popular tools? It is true that indexing engines such as Google, which create metadata by scanning millions of documents, will have little incentive to "share" their metadata with others. Their business model is to provide an end-user service based on the metadata that they create. But remember that I defined "the semantic web" to be about that semantic data which is created deliberately by users in the same way that people deliberately publish web pages. In the potential market for tools that allow users to benefit from shared metadata, there will be a temptation to create customized metadata formats for each tool in order to lock in customers or meet specific short-term business needs. Tools may certainly create metadata that other tools do not find meaningful, but it will often be in a tool creator's best interests to make that metadata interoperable with other tools, as meaningless as current tools may find it. Some motives for interoperable metadata are provided here.
Tools Choice
Consider annotations, for example. Even if two different annotation tools keep track of slightly different metadata (perhaps one records the date of the annotation and another does not), it could be wise for the metadata formats to be as interoperable as possible, because then each tool would have the future option of enticing users away from the other tool. Allowing metadata to be shared across tools encourages competition among tools and allows users to choose the tool (annotation viewer, for example) which has the features they want. It also provides a stable platform to create tools that combine features of other tools.
Inference
Suppose that you have one tool which keeps track of company names and and mailing addresses, while you have a completely different tool which keeps track of all of your friends' email addresses and company names. A tool that was smart enough could look at a friend's company name and match that with the information about company mailing addresses and infer a mailing address for your friend. Widespread inferencing is something that is probably not an immediate opportunity for tool creators, but as long as the metadata created by tools is interoperable, it should be easier for future tools to use that data to deduce new metadata that wasn't ever recorded. In the example above, nowhere was it ever recorded what your friend's mailing address at work was -- this information was attainable because the two different sets of metadata were interoperable and because the tool reading the data knew how to infer that information.
Network Effect
Think of the bazaars that were once common in cosmopolitan cities. Why is it better to go to the bazaar to barter your goods than to stand on your street corner trying to barter? The answer is common sense; it is better to go to the bazaar because that is where everyone else goes. One would imagine that humans have instinctively understood this for thousands of years, but today we attach some pseudoscientific mathematical formula and call it "The Network Effect". How tool vendors can apply this common sense is to realize that their tools will be even more compelling to users if all of the users of different tools have all of their metadata collected into a shared space. Just like people from different cities and different cultures meet at the bazaar to barter, different tools can choose to save metadata into a shared city of interoperability. If users of tools such as third voice can view annotations made by users of other tools, and users of other tools can view annotations made by third voice, users will see much more value in using annotation tools. Note that this so-called network effect is applicable to certain situations, particularly in the collaborative areas to which I limit my discussions of "the semantic web". For example, it would be blindly stupid for Google to export all of their metadata in a format that could be used by MSN search for free. Not only would Google not benefit from such a move, users would not benefit either. Although it makes sense to barter goods in a place where everyone else is bartering, there are no "network effect" benefits likely to be gained if everyone were to collect in one spot to read a book or watch television, for example.
Architecture
OK, now that we've established what we are talking about (in this document) when we say "semantic web", let's hurry along and cover some of the architectural issues that are involved with actually implementing the semantic web.
Centralization?
First, and perhaps most important, of the architectural challenges is that of centralization. Nobody wants to store their data on centralized servers, because then their data would be vulnerable to Byzantine threats (like governments and floods). This is especially true of metadata -- if you put up a web page criticizing a political candidate, it's unlikely that many people will see your page without a serious effort on your part. On the other hand, if you add an annotation directly to the candidate's home page, anyone using an annotation viewer will see it when they visit that candidate's home page. Some candidates could be convinced that blocking such comments would be beneficial to their campaigns. Such concerns mean that metadata which you intend to share with others will probably want to be on servers that are not too centralized, and probably not on the servers hosting the data you are tagging (Of course, if you own the web pages, you could embed the metadata in the page). Metadata storage techniques that are vulnerable to Byzantine threats may not inspire enough user trust to reach a critical mass.
Besides threats to privacy, there is another problem with metadata being centrally stored or indexed. Tools depend on the ability to query metadata (for example, a tool might ask for the top five recommended links associated with a certain page). Having a standard way in which tools can query metadata servers allows for tools to be independent of servers, and encourages tool development and service development. However, if there are only a few central indexing services, there will be a strong incentive for these services to compete by differentiating their services, adding features, and "improving" the query interfaces. It is likely that reducing the number of metadata services would lead to tools that are tightly-coupled to one service or another.
Scaling Metadata
Scalability factors are where the differences between metadata and data become apparent. First, the location that web pages are published has little to do with your ability to experience those web pages. As you follow hyperlinks throughout a site and on to new sites, your browser contacts each location as appropriate and gets the file that you want to look at next. You browse one page at a time, more or less. On the other hand, metadata is not something that you browse "one piece of metadata at a time". The truly interesting metadata is useful only in aggregate. Consider the case of a rating system. Suppose that one hundred different users have all rated the www.ibm.com page with a number between one and five and have stored their rating in a standard file format on their own personal web sites. Would you want your web browser to contact every user's site and combine all of the ratings each time you browse to www.ibm.com? Of course not! It is clear that (unlike web pages) metadata is not interesting when considered grouped by publisher, but is interesting only when grouped based upon the thing which the metadata describes. But the most natural way to group things based on the thing being described (the web pages) is to just store the metadata with the web page, and we've already decided that is a bad idea. Web page owners have too much incentive to remove metadata that they do not agree with.
Decentralize Data, Centralize Metadata
In fact, metadata is intended to be indexed, searched, and processed by tools, while web pages are designed for humans to view one page at a time. The best way to index something to allow fast searching and processing is to centralize it. If you consider Napster, you can see that the metadata is centralized for fast querying, and the data is decentralized for fast retrieval. This design rule, "decentralized data, centralized metadata", can be seen on the web today. At the same time as HTML pages are being decentralized further and further from the core through services like akamai and inktomi edge caching, searching functions on the Internet are being centralized into ever more powerful engines like Google. Note that it is possible to have actual metadata stored in a decentralized way, but the indexing and querying of that metadata will be most efficient if centralized.
What it Means
Users will demand systems that perform well, and as we push the limits of what we do with metadata (multiple levels of inference, for example), the performance strain of querying metadata will become increasingly greater. Because the performance requirements will increase as user expectations of features increase, people will tend to be drawn more to metadata services that store metadata centrally. At the same time, the net's libertarian ethic will mistrust anything that is too centralized. The social desire for decentralization will probably be no match for the consumer demand for features and performance, however, so it is likely that most collaborative metadata functions will end up collecting in few centralized services. Developers of shared metadata services will need to design with the expectation that the inexorable forces of nature will tend toward centralized metadata, and build in safeguards to make sure that their systems cannot become too centralized (and thus become exposed to byzantine threats, jeopardize critical mass of user willing to use the service, etc.) This fundamental characteristic of shared metadata to flow to the center is also an opportunity for the software politicians. As market forces drive metadata into centralized repositories, these centralized services will become wonderful targets for conspiracy theories and other such sensationalized portrayals that help the software politicians maintain flock size.
Metadata Explosion
The final scalability challenge threatening to cloud metadata's bright future is the probability that the amount of metadata being generated will be far greater than the data being tagged, and the number of queries made against metadata could exceed the number of queries made to request the related web pages. You can easily imagine the combined size of annotations and comments about the www.microsoft.com home page being much larger that the size of the web page itself. Not that this would be a problem, but remember -- we are talking about centralizing this stuff at least a little. You might take hope from the fact that Google actually keeps complete local copies of every page that it indexes, so in a sense Google has already decentralized the entire web onto its storage arrays. But there are two things that suggest this is not the most instructive example. First, storing a page is not the same as storing dynamically changing metadata about that page (which may exceed the size of the page) without having that metadata go stale. Second, buying enough permanent storage to cache the entire web is the sort of expense that would guarantee that only a few centralized players would be in the market, and this overcentralization would tend to defeat the idea of "the semantic web".
It's About Communicating Your Opinions
It seems like it will be pretty challenging to scale this "semantic web" thing. But before we give up, let's step back and think about an analogy that holds especially true for metadata. Consider the possibility that all of this metadata stuff is just a way to say "communicating your opinions about web pages". Think of the examples given in the "Reinvigorating the Web" section. Even when you categorize an article as being about "Sea Mammals", that can be thought of as your (wise) opinion. Other people might categorize the same article in the "Medicine" category because they believe that dolphins possess healing properties that can be harnessed by grinding up dried dolphin fins (it's a stretch, but you get the point I hope). The example earlier of a tool which queries for company information based on the owner of the site you are viewing (in the example, information on CNET when browsing news.com) is equally "opinion". If the tool uses a fairly respected source to determine that news.com is owned by CNET, then you can consider that opinion to be quite authoritative, and not really an opinion, but it's more useful to think of metadata as a bunch of opinions about things, some of which may be more authoritative and others more speculative. That's the way everything in the world is, anyway. Now, you may be thinking, "What about the tool that scans the web page for bands to look them up at RollingStone.com? It is using a grammar parser and natural language processing, so how could that be an opinion?". To that question I reply, prepare to be shocked the first time you are reading a web page about the Virgin Mary and your web browser redirects you to the web page of the singer "Madonna".
Anyway, I promised that you would get some insights about scaling shared metadata by thinking of shared metadata as being shared opinions. So let us start by looking at some ways that people share opinions.
Message Boards (HTTP)
Places like epinions.com, yahoo groups, and MSN communities allow people to post their opinions about all sorts of things, from products to politicians. Once you post your opinion, it usually shows up right away. Other people need to go to the site and specifically request your opinion to see it, though. If you have your own web server, you can always write your opinions into an HTML file and save it to the server as well.
Mailing Lists (SMTP)
Some message boards allow you optionally to have the messages sent to you as e-mail. Mailing lists allow you to e-mail your opinion to a group of people who are interested in (or at least tolerant of) your opinions. Some mailing lists also archive e-mails to the web or allow you to post to the web, but there is an important difference between the web-based message boards and e-mail lists. With an e-mail list, you do not get to see the other person's opinion the instant they send it, but you don't have to go poll a central web site to get the opinion, either. Using e-mail pushes the information closer to you and adds a little bit of latency.
Newsgroups (NNTP)
When you post your opinion to a usenet newsgroup, people on the same server as you will see the message pretty quickly, but someone on the other side of the globe may have to wait much longer as the NNTP servers pass the opinion along. You get higher latency, but you don't have to have copies of all the opinions in your e-mail inbox. The other important thing about NNTP is that the system is designed to allow multiple distributed write masters to create data, and then the data is replicated around the system. A majordomo-style mailing list or an HTTP-based message board both involve central write masters. The multi-master approach of NNTP makes it a good choice for sharing opinions when there are many people participating and huge volumes of opinion traffic.
Chat (IRC)
With chat, everyone sees your opinion pretty much instantly. You need to keep the number of people in a chat room small, and messages terse, though, because things would get overwhelming if you were passing along the quantity of information that goes over a large usenet newsgroup. To get extremely low latency, you give up scalability as well. Large IRC networks will add multiple write servers and synchronize between, which adds a small amount of latency.
Push, Retention, and Latency
There has been some hype in recent years about "push" technologies. To me, the difference between "push" and "pull" is this: when a system delivers a message to a system that you own, then it is called "push". When a system keeps the message on a machine that it controls, we say that we have to "pull" it down when we want it on our own machine. "Pushing" data to systems that want it (as is the case in the SMTP and IRC examples above) is pretty expensive, so most systems that push data like to delete the data once they are done pushing. For example, how easy is it to go back to an IRC server and see what messages were sent yesterday? On the other hand, web-based message boards and NNTP servers tend to keep copies of the messages around for longer periods of time; that is, they have more liberal retention policies. If we define latency as being "the amount of time between the moment that someone publishes an opinion until the time that the opinion is sitting on a machine that you own", the choice of technique obviously has an impact, although it doesn't seem dependent on push or retention. You can technically keep hitting the "reload" button on your browser for a web-based discussion board and get posts the instant they arrived. You would probably irritate the board owners, and that is certainly not the most scalable use for a web-based board, but it's possible.
Just in case you ware wondering what these communications techniques have to do with metadata, consider what it would be like if everyone who posted their opinions was polite enough to use a standard format for their opinions telling who or what the opinion was about, what type of opinion it was, and then the opinion. As long as people always used this simple format, tools could read through the text of the opinions and do cool things. For example, you could write a tool that reads through the newsgroups for all opinions with the "about" field set to "www.ibm.com", the "opinion type" set to "rating 1-5", and average the values to see what people think. Making sense?
Aggregation and Filtering
Aggregation and Syndication
Going back to the scaling problems of metadata, you'll notice that most of the above communication techniques involve some amount of centralization. It is possible to do multi-party chat in a more decentralized way than IRC (MSN IM, AIM, etc.), and it is possible to share more permanent messages and documents collaboratively in a more decentralized way than NNTP (Groove), but these techniques aren't designed to scale into large numbers of users. This is a basic law of scalability; the more individual nodes that allow updates, the more expensive it will be to merge those updates together later. Systems like NNTP decentralize in the directions of the user as much as possible, to allow faster retrieval. This is similar to the way that akamai edge caching moves the data closer to the users who demand it the most. At the same time, however, the number of NNTP servers per user remains low (like 0.0001), and updates are batched to "flow" throughout the network of servers. You could put an NNTP server on everybody's personal machine, but then everyone would have to wait a year or two before seeing someone else's post. In general, we call it syndication whenever one source of information publishes that information to other sources, and we call it aggregation when one site pulls information from multiple sources. Syndication could be seen as a decentralizing flow, while aggregation would be a centralizing flow. But typically syndication sources publish to other aggregators, and the users are eventually required to go to an aggregator and retrieve the information, rather than have it syndicated directly to their desktops. This is the case with NNTP, where one NNTP server may serve as a source for many other servers, but users generally poll the server for new articles rather than having an entire newsfeed delivered to their desktop. The distinction is sort of fuzzy, but it's mostly the same issue described with "Push, Retention, and Latency" discussed earlier in this document. Often sources that syndicate data will syndicate an entire article. On the other hand, aggregators will sometimes aggregate metadata (like headlines and descriptions) and simply point to the actual location of the article. You can think of NNTP this way -- when you post an article to the newsgroups, the entire article is copied to thousands of aggregators world-wide. If you take this analogy a step further, when you use a newsreader to connect to a news server, you usually only collect the titles of the articles and pointers to some sort of ID that you can use to grab the entire article from the server if you want. In other words, NNTP servers past mostly data amongst themselves, and newsreaders extract metadata from the news servers to start. (Now, if the distinctions aren't fuzzy enough for you yet, remember that we are talking about sharing opinions, and we are pretending that all newsgroup postings are really opinions about some web page or another. So the articles themselves are metadata, and the list of topics being downloaded by the newsreader is technically metadata about metadata. This should make you realize that the distinction between metadata and data is fairly arbitrary, and that there is nothing wrong with taking meta to multiple levels -- in fact, in this example it is a good architectural decision. If that makes sense, then you are a very smart person or I am a good writer, or both.)
Aggregation can also be used in the sense that databases aggregate data by summarizing, averaging, or other similar functions across groupings of the data. For example, an NNTP server could collect all of the opinions posted to a newsgroup that give IBM a rating between one and five, and then combine all of those messages into a single message with just the average rating before sending the information to another NNTP server.
Channels and Subscriptions
The other important thing about all of the communication techniques listed earlier is that the user is never forced to cope with all of the data going over the network. For example, on IRC you only join the channels that you are interested in. You only go to web-based message boards that interest you, you only read newsgroups that you find interesting, and so on. This process of selectively subscribing (or joining a channel) to get at interesting opinions is basically a method of filtering. In fact, the filtering can take place between servers on the network. For example, a company running a corporate NNTP server may choose to filter out all of the alt.* newsgroups, but keep the rest.
Decentralize Data, Centralize Metadata, Decentralize Meta-Metadata
Ok, so the title of this topic is a bit stupid. I wanted to keep continuity with the "Decentralize Data, Centralize Metadata" theme earlier, but convey the idea that what will really be happening is a series of flows involving centralization, aggregation, and subsequent decentralization. In the examples we have given above, the process of deciding how decentralized the data gets (to push it closer to users for fast retrieval) and deciding how to syndicate and aggregate the data is quite manual. That is, humans at various points decide to establish an NNTP or IRC server, decide which other servers to connect with, establish filtering rules, retention policies and so on. The current trend toward edge-caching of content through techniques like inktomi or akamai is actually very similar to the aggregation/filtering cycle, though. In a sense, the edge caches act as aggregators of web content, and they filter the content they aggregate based on popularity (which is a bit of semantic metadata that they have to determine dynamically). Edge caches available today are not nearly as sophisticated in their range of options as are the manual syndication and aggregation systems, but they are able to operate largely without human configuration. As edge-cache intelligence continues to increase, and as aggregation and syndication systems add more intelligence, it is possible to imagine these two techniques converging.
Architecture Summary
The semantic web wants to be decentralized and open. This is not to say that centralized uses of metadata tightly-coupled to tools are a bad idea -- in fact, there are plenty of good reasons why such uses might evolve and serve user needs. However, those applications are not the topic of this document. This document is about "the semantic web", and the semantic web is by definition inclusive and interoperable. Creative application of aggregation, syndication, and filtering techniques will probably be necessary to scale the semantic web appropriately without sacrificing inclusiveness.
Advanced Issues
This section talks about things that will need to be figured out as the semantic web evolves, but which are still rather far from being resolved.
Trust
Going back to our discussion of how "authoritative" a piece of metadata is, it is clear that you will often want to filter your metadata based on people you trust. Yes, the semantic web needs to be inclusive, but just because anyone can post their opinion, that doesn't mean you have to trust it or even read it. Here are some ideas that people have discussed (without any discussion of practicality):
E-Mail Grouping
You could filter metadata based on the e-mail address of the submitter. You could possibly use membership in a group to determine trust (for example, you could trust members of the yahoo group "p2p-discuss").
Digital Signatures
Digital signatures would be necessary in cases where you cannot tolerate the risk of your metadata being poisoned by spoofed entries. Note that there is nothing saying that digital signatures are an all-or-nothing idea. You could perhaps do something like, give all people with verified digital certificates a trust level of 6, all people with hotmail accounts a trust level of 3, and all others a trust level of 0.
Rating Raters
Suppose that you browse to a page that clearly sucks, and your browser tells you that 12% of people visiting it thought the page was excellent. You could perhaps tell your browser to assign a trust rating of zero to all of the people who thought it was a good page. You could go even further and post your ratings of these people as metadata, and then all of your friends would know to not trust these bozos, too. Some examples of this technique would be the user ratings on ebay, the "web of trust" on epinions, or the trust hierarchy on advogato.
Affinity
If your main goal is to filter out all metadata except that which is most likely to match your interests and opinions, there are data mining techniques that can cluster data based on patterns of similarity. This is the same technique used by amazon to let you know that "other people who liked this book also liked these other books." The metadata layer could automatically detect clusters of similarity and group users based on those patterns.
Inference
So imagine that you are rating a web page that contains a famous poem. However, someone else may have that same poem copied on a different web page. Wouldn't you like your rating to apply to both copies of that poem? Obviously you need a way to say that the two poems are the same, and as long as one noble user stored the piece of metadata pointing out that both web pages had the same poem, the rest of us could rate whichever page we happened to hit, and the metadata layer could infer that we were also rating the poem on the other page. Inference is really about the way that we express relationships between things, and the way that the metadata system can use those relationships to do other cool things. This one is even harder to solve than "trust".
Beyond Web Pages
In this document, we have only talked about storing metadata about web pages. The web is made up of web pages, so this makes sense, right? But some people like to think of the web as being simply a bunch of URLs, all of which may or may not point to an HTML page. For example, I could use a URI like http://my.imaginary.com/people/JoshuaAllen, and then plaster my rating (10 out of 5) of that URL to any metadata server I can find (the point being that the example URI refers to a person and not a web page). People involved with metadata like to constantly step further and further back to look at the "big" picture, so the idea of using metadata to represent all sorts of entities is very exciting to them. Just imagine if you enter in all of your knowledge about the world as a set of opinions or assertions, and then get some mega-powerful inferencing engine to tell you the secrets of life! The idea that you can develop a metadata framework that supports cool web pages and is robust enough to support general artificial intelligence is certainly a motivating factor for many of the people working on this stuff.
Formats
This section describes some of the formats that can be used to express metadata.
XML
XML is already a common syntax that many tools use, and XML documents do include semantic information. For example, I may have a common "employee" document format that I use, and you may use a different format, but the tag names in the document provide some hints as to the content of the data. If my format uses the tag name "StreetAddress" and yours uses "Address", we can still interoperate. In fact, if there were a central repository (ignore for now that this is a bad idea) that recorded the fact that "StreetAddress" in my XML mapped to "Address" in yours, our import and export tools could happily exchange employee data without requiring intervention on our part. (Although this is similar to "inferencing" discussed earlier, inferencing is normally used to describe those cases where we discover something that wasn't explicitly recorded -- this example is more about "translation", although inferencing is certainly possible with straight XML). XML documents are hierarchical. This means that each element or piece of data can have only one parent. So if your metadata uses the parent-child relationships of the XML structure to represent meaningful data, you will be unable to express meaningfully any sort of multiple-parent relationships. (People try to get around this with id and idref tricks, but it is best to just think of XML data as being a "tree", or in discrete mathematics terminology, "a node-labeled graph").
OPML
OPML is essentially an XML format for storing outlines, and outlines are also "node-labeled" hierarchical structures, but OPML has an advantage over straight XML in that it restricts the "labels" on the nodes to be stored in a single attribute value. This might not be an advantage for some types of metadata, but it greatly simplifies things when you only want to represent a hierarchical structure. It's about the simplest node-labeled structure you can get.
RSS
Rich Site Summary (the name for RSS 0.91, which is the most widely-used version of RSS currently) is another XML-based node-labeled hierarchical format, similar to OPML, but it uses tag names to represent various metadata. The metadata that are permitted to be used are laid out in the specifications, also in the Really Simple Syndication (RSS 0.92) specification. The metadata defined by these specifications are fairly general, but nevertheless define a vocabulary that needs to be used by interoperable implementations. It's a step more restrictive that OPML, but the act of defining some common metadata allows tools to count on the meanings of those metadata and build services. The metadata defined by RSS are meant to be meaningful primarily to syndication and aggregation tools.
RDF (and RSS 1.0)
Resource Description Framework is the W3C recommended way to store generic metadata. RDF is an XML format that allows metadata about pretty much anything to be expressed. RDF also is designed in such a way that metadata can always apply to a URI instead of to the parent item in the XML structure, so it allows one-to-many and many-to-one relationships to be expressed. This allows metadata to be stored in a non-hierarchical manner and allows edge-labeled graphs to be represented. Since RDF attempts to be a generic format for all metadata (and metadata about metadata, ad nauseum), it is extensible by design. In other words, you can store RDF metadata about resources described in a different piece of RDF metadata without worrying about screwing up the format of the original metadata or breaking tools that use it. Parallel to the RSS 0.92 work, RSS 1.0 was developed as RDF Site Summary. RSS 1.0 uses RDF to describe basically the same things as the original syntax, with a few additions. All of the RSS family of specifications are designed for tools that do syndication and aggregation. You can certainly put RDF and OPML/XML to other metadata uses besides just syndication and aggregation.
Dublin Core
Before the XML craze, a set of common tags for categorizing and describing web pages were created. These tags, called Dublin Core, define some pretty common sorts of metadata that you might want to use, so the XML formats like RDF often borrow from Dublin Core.
Credits
Thanks to Dave Beckett, Jonathan Borden, Dan Brickley, Len Bullard, Rael Dornfest, Edd Dumbill, William Loughborough, Uche Ogbuji, Sean B. Palmer, Aaron Swartz, Paul Tchistopolskii, Dave Winer, and someone else that I surely forgot to alphabetize here. The people listed here do not necessarily agree with the content of this paper or even like me, but they have contributed all sorts of very intelligent discussion that has helped me to decide how to present these issues.