01 April 2007

Semantics, Categories and Knowledge

This is the first of a many-part blog about what knowledge is on the web. I will start from the several, conflicting, claims about what knowledge is and how it works, or should work, on-line, and will continue by explaining what I think knowledge is and why it already works on-line.
Why should I worry about such esoteric themes? My concern is that there is a lot of claims being made these days about what knowledge is and how we can automate it on the Web. This includes the many discussions about the Semantic Web and Web 2.0, as well as the endless whine, from some quarters, that the Web is a confused mess and, therefore, urgently needs organising.
My argument, which will take a few instalments to make, is:
  • That what makes the Web a powerful knowledge source is that it is self-organising.
  • That knowledge, and hence understanding and competence, is not a matter of knowing the categories that make up a subject and the rules that relate these categories, but is a process of engaged work and interaction.
  • That what makes something meaningful, and hence knowledge, is a process, not a thing. It is therefore something you do rather than something you get.
  • That knowledge making, and knowledge representation, is necessarily a messy business that requires a diversity of approaches.
  • That translation between logics and knowledges, like languages, is not a straightforward rule based process, but is partial, interpretive and, at times, impossible.
  • Finally, that despite the fact that the Web is a network of computers which are calculating machines, machines that require categories and rules (algorithms) to work, the Web is, nevertheless, an excellent system for sharing and creating knowledge.

Knowledge as organised words and concepts: The Semantic Web:
I want to start by giving my description of The Semantic Web. I know that this seems to be a waste of time as there are a huge number of descriptions and explanations of The Semantic Web, but my description has a slightly different purpose than these many others. My purpose is to examine The Semantic Web as a claim to organise knowledge.
Just to get it out of the way, I should start by stating that I am not opposed to XML or RDF, or even necessarily to ontologies (as used in computing). I regularly use XML and am beginning to use RDF as a means for querying resources. They are both powerful tools and immensely useful. However, as I hope will become clear, they are not straightforward systems for representing knowledge.

The Semantic Web, as defined by the W3C, is not just a series of mark-up languages, but is a multi-tiered model of knowledge (see The Semantic Web Revisited. This vision starts from what it sees as the most basic Web-resource, the URI, and moves up through a series of 'syntaxes' that situate the 'meaning' of the Web-resource in ever more universal categories.
At the base of this pyramid of meaning is what is called a "surface syntax", the XML. XML, as many of you already know, is a convention, as defined by The W3C, for marking-up components of a Web-resource with names (tags). The XML convention does not place any requirements on what these tags should be, just on the syntax of the mark-up. This is what gives XML its wonderful flexibility to represent just about anything in any form and in any language. It is completely local. It allows the author, or authors, to describe the resource using any form of tags, or names, they wish (see W3Schools XML introduction).
Above the XML layer is the XML Schema and Query languages. These are equally useful as they allow the author(s) to describe what they see as restrictions on the structure and content of the XML document. Basically, the XML Schema file is somewhat like a data-definition for a database, though it does a bit more (see W3Schools XML Schema). XML Query, or XQuery, is a developing language for writing queries to XML files, preferably with XLS (XML Schema) definitions (see W3C XQuery overview). Both of these conventions extend the open mark-up of XML so that XML files can be queried as data sources. These conventions are also very flexible and extensible allowing authors to define their data as they wish, and extend these definitions to other XML resources.
Up to this point, it all seems quite sensible. We have a much more flexible and extensible mark-up language, which can define information almost anyway that the author wants. We have a Schema and a Query language that allows for open and extensible definitions for how the author thinks people, or machines, should access and search this resource. Because of the structure of these three documents, we can even see other people writing XML Schemas and XQueries for any XML file whether they authored it or not. This makes for a very open-ended Web.
This may accommodate the needs for flexibility locally, at the file-face, so to speak, however, we are told that "there is no structure", "no agreed terminology", "no order", and hence "no knowledge" at this level. Something must be done, and W3C has done it.
Above these primary layers in the pyramid is the next major layer, the RDF or "Resource Description Framework". W3C defines the RDF as metadata and uses the Dublin Core as its basis. RDF defines the metadata for a file using XML syntax and predicate logic. Predicate logic is simply a form for an assertion. That a subject (a webpage or other URI) has a property (a name, a date, a place, etc.) which is an object ("Robin", "2007-3-28", "Lecce, Italy".) In other words, we can say that this Blog has an author who is Robin.
RDF allows for the association of a set of assertions, set out in predicate logic, about a resource. However, the use of the Dublin Core suggests that the class of statements found in an RDF is of a different order-a higher order. It is at this stage that we see what the Semantic Web is all about. The organisation of knowledge on the web through every higher orders of [general] assertions.
Above the RDF is another layer, the OWL or "Web Ontology Language". The OWL is distinct from the RDF and always refers down to it as OWL is a more restricted language that is intended to be more "machine readable". The justification for this further level is made clear by the W3C's "OWL Web Ontology Language: Overview".
The Semantic Web is a vision for the future of the Web in which information is given explicit meaning, making it easier for machines to automatically process and integrate information available on the Web. The Semantic Web will build on XML's ability to define customized tagging schemes and RDF's flexible approach to representing data. The first level above RDF required for the Semantic Web is an ontology language what can formally describe the meaning of terminology used in Web documents. If machines are expected to perform useful reasoning tasks on these documents, the language must go beyond the basic semantics of RDF Schema.
OWL provides a language for defining Classes, the Members of Classes, the Properties of the Class and the Relationships between Classes. In other words, OWL is a language for defining higher order classifications between objects defined in the RDF. OWL is a type of ontology, as used in computer science, that defines the higher order classes and logical relationships between objects. (see Ontology_ Computer Science at Wikipedia)
Perhaps an example might be helpful here, one drawn from W3C's webschool.
OWL Example (Airport)
OWL Resource: http://www.daml.org/2001/10/html/airport-ont
Class: Airport
  • elevation
  • iataCode
  • icaoCode
  • latitude
  • location
  • longitude
  • name
Here a document has defined a class called "airport" which has certain properties (elevation, iataCode, longitude, latitude, etc.).
You may think that this is just another layer of metadata, and you would be right, but this is not its intention. OWL is meant to be a higher-order classification of objects than RDF. Where RDF is fairly open as to what subjects it can assert properties to, OWLs are seen as something much more universal-as essential classes such as persons, mammals, airports, continents, countries, etc. These are seen as "common metadata vocabularies", or fixed vocabularies and definitions. Vocabularies that are shared and fixed either across communities of users or across the Web.
However, even if OWLs are higher-order agreed classes that can handle subsumption and classification, there are many other kinds of logical processes that it cannot handle. To extend the logical and autonomous reach of the Semantic Web, W3C is developing RIF, Rule Interchange Format. The intention of RIF is to allow rules to be translated between rule languages allowing different systems, based on different rule systems, to interoperate.

In the next entry to this blog, I will explore that amorphous collection of grassroots Web that Tim O'Reilly has called Web 2.0.