14 August 2007

Wikipedia, Freebase and the Semantic Web

There is a lot of discussion about how to organize information on the Web. For that matter, there is, and always has been, a lot of discussion about how to organize information generally. I have been on Freebase for about the last month and have found the differences between its approach and those of Wikipedia, at one extreme, and the Semantic Web, at the other, very enlightening. It is not that I think the Freebase is the ultimate answer, I do not. However, I do think that it offers a very interesting alternative to the other two extremes of information organization. Freebase offers a middle ground between the two extremes. It offers the ability to add as much information as possible, but makes only one requirement -- that each 'topic' has only one instance. In recognition of the Semantic Web, a series of high level 'types' are being created, but, unlike the Semantic Web, anyone can extend and create new types as they wish. This may not seem like much of a change, but it is, in fact, quite profound.

The key differences between the Semantic Web and Wikipedia are very telling. If we first look at the difference with the Semantic Web, we see that Freebase has abandoned the central tenant of the Semantic Web (SM). This is that there is a universal logical structure to all knowledge, and that this can be defined (by a select few at the top of the W3C). The Types of Freebase may seem to be very similar to the high-level types of the SM, but they are much more like a hybrid of RDFs and OWLs. Rather than creating a pyramid of truth, as SM is trying to do, Freebase's Types are a more traditional, and more pragmatic, categorisation of things. The high-level Types of Freebase to not, yet, claim to be higher order concepts, but generalised conventions such as 'people', 'places', 'times', etc. I will not go into the discussion as to why these Kantianesque categories are problematic as it doesn't really matter here. We can happily use these categories within Freebase even though in most contexts they are problematic and uncertain. The key point here is that the pragmatic categorisations -- Types -- of Freebase are infinitely extendible where those of the SM are ultimately reducible. Freebase may seem very Semantic Webish, but it is not as it inverts the logical structure of its categorisation. Whereas the SM starts from the messy diversity of the information world and, it hopes, progressively refines it to basic principles, Freebase starts from some pragmatic general categories and allows us all to extend them.

The differences with Wikipedia are even more interesting. Whereas Types are a kind of inversion of SM's hierarchy, Freebase takes on the Wikipedia's uncontroled extension through its definition of 'topics'. By insisting that each "thing" in the world has only one instance -- one topic entry -- Freebase hopes to overcome the multiple accounts that proliferate on Wikipedia. They hope that it will be the categorisations that will proliferate, not the instances.

This is not a bad idea, though it is fraught with its own problems. Robert Cook and I had a few discussions about this problem here and on Freebase, though I don't think I expressed my concerns very well. Perhaps I can clarify a bit here.

What is very interesting about Wikipedia, and all Wikis, is that when a topic is begun it takes a bit of time to stablise. The process of stablisation usually occurs as a certain group of wiki-editors appropriates the topic and keeps others from complicating their version. As a result, others, who may disagree with the now 'authoratative' account create other entries with different accounts. We might call this process "budding'. Other accounts "bud" off of the original stable account to create a constellation of accounts around any topic. It is this budding that Freebase is attempting to avoid.

As I have stated below I do not think that this is a problem as I see it as a sensible pragmatic decision. Not lease as this problem, how to link-up all the different account which surround a topic, is one of the most difficult in the history of philosophy. Thomas Kuhn, for one, demonstrated in the 1960s that this kind of budding around a stable topic is the key mechanism of paradigm shifts in science. Others have argued since that it is a key mechanism in all knowledge production. As such, to legislate against this budding around topics could have serious implications for the future of Freebase.

A problem is, though, that to go down the route of Wikipedia won't work either. There is no way of accounting for the discursive connections between the stable topic and the buds. By Freebase keeping one topic instance and one topic instance only, they overcome the problem of the multiple instances, but at the cost that they deny any mechanism for accounting for the new and diverse opinions that create paradigm shifts in knowledge. What I am arguing here is not very different to that proposed by Marvin Minsky in his Society of Mind theory. Or, for that matter by Danny Hillis, the founder of Metaweb.

I am afraid that I too have no real solution to this problem, but I do ask the people at Metaweb to not ignore this problem by claiming that the single instance topic is philosophically real.

02 August 2007

Freebase revisited

I was very pleased to see that Robert Cook responded to my comments below, and felt that he was right on one point, that I needed to clarify my final point. I agree that I did finish off a bit abruptly.
It is not that I think that Freebase will fail. In fact, I think quite the opposite. Though, I do think that my point is germane not only to Freebase, but to such knowledge accumulations generally. The problem is that we think we know things by knowing how to name them -- and knowing what the name means. We are, in the West, constantly taught that this is how we know. We are bombarded with training manuals that classify the subject and explain to us this classification. We are constantly exposed to a media that is classifying and naming the events around us. We are constantly trying to understand what is going on around us by finding the appropriate names and the appropriate meanings for those names.
We are told that we are this sort of an employee; that we are that sort of a resident; that we are this class of a tax payer; that we are male or female or gay; that we have this type of body; or, hopefully not, this kind of disease with these characteristics. We are classified, ordered and named constantly. We are told that the advances of science and medicine and society are this or that sort of thing. We are told that the problems of the world are due to this type of person, or this type of belief or, worse, this type of religion.
We are also told that types of things have definitive characteristics. When these definitive characteristics are correct, the thing is right, when they are incorrect, the thing is wrong. We see this in the gay debate, or the debate about terrorism. It is not that different people have different characteristics, or that they interpret their characteristics differently -- or even that the social context of these interpretations is very complex -- but that there are bad characteristics, ones that do not fit the norm. But, of course, what is the norm, and how is it defined? I'm not going to go into that, as there is a huge literature on this subject. I would just point you to, if you are interested, the work of Michel Foucault and the hundreds of works about the social construction of the norm.
We could also ask the question, which is more pertinent to the Freebase discussion, What is a thing anyway? Is it a unique entity that simply has names and characteristics defined onto it, or is it something more fluid, dynamic and constructed? Now I'm not an idealist, I do not believe that everything in the world happens in my head, and that there is no reality outside of my mind. But there is a big difference between the physical object and what we say that physical object means.
Naming and classifying an object is certainly a kind of meaning, it would be absurd to say it wasn't. However, it is but a 'kind' of meaning, if I can use classification to explain classification. We use classifications because they are useful, very useful, but usefulness implies use. We use classifications, or whatever method, to understand things because the actions performed, social and practical, allow us, with others, to construct an account of the world that supports other meaningful actions. In this sense, we do not have understanding, as we have a car or a house, but understanding is something we do. It is a skilled activity.
We could say the same of things. We do not know things because they have innate characteristics, that we know more or less well, but because we are able to perform particular meaningful actions with them. Classification is one such meaningful activity that we do with things.
In this way, Freebase offers a very useful approach to the accumulation of accounts of things. Not because it is realistically or definitively defining the world of things, but because it will allow us all, through a dynamic classification, to define our many different domains of understanding. More importantly, Freebase offers to possibility to ensure that these different domains are communally defined and maintained. It should ensure that these various orders of the world, the various domains, are the emergent result of communities of knowledge, not single singular assertions of a single community.
In my next post, I plan to discuss why Freepress is a much better approach to this problem of knowledge order than the simple Wiki.

19 July 2007

Freebase: Ideology vs. Practice

I have been on the alpha version of Freebase of about a week now and I'm very impressed. It is an interesting experiment in how to find a reasonable median between the vast openness of Wikis and the narrow-mindedness of the Semantic Web. Though the user expandable types and properties, it looks to be a very exciting development. The community definable domains will prove even more exciting, I believe, as the folks at MetaWeb realise just how powerful these are for different domains of expertise and knowledge.

I was somewhat dismayed, therefore, when I read Robert Cook's latest blog in his Freebasics blog. This blog entry is a comparison between Freebase and Google Base, most of which I agree with. However, he goes on to say that Google Base has many different records for each object where ...

"Metaweb, by contrast, has just a single record for the Canon EOS 20D with redundancy and discrepancies resolved. Metaweb contains only ‘reconciled’ data, and maps a single object to a single thing in the world."

and that ...

"This idea of reconciliation is core to the idea of a Metaweb Topic. From my earlier posting:

  1. A topic represents a person, place, thing or idea.
  2. No two topics should have the same meaning.
  3. A topic should be important enough that a group of sane people would have something to say about it."
The fact that a topic on Freebase represents a person, place, thing or idea is just fine, as is the point that a topic should be important to a significant group of people. I am not sure why he defines them as being necessarily sane, as it is my experience that different groups of people have different interests, often vastly different, sane or otherwise.

The major problem I think arises from point number 2. Cook goes on to underline this point ...

"All distinguish Metaweb from other online data sources, but the second one is the most important. A key value of Metaweb is to squeeze out redundancy so that people (and machines) have definitive information."

But what is "definitive information"? Knowledge is not a definitive set of attributes or properties, but a rich history and contemporary discussion about the object. From this emerges attributes and properties, but these are constantly under dispute. Could we imagine a scientific discipline where there was only allowed one account of any process or object? It would be disastrous. Could we imagine any industrial process where only one account, or topic, could have only one meaning? Culture, industry and science as we know it would cease to be.

Knowledge is promoted, grows, evolves and develops through disagreement, challenge and critique. It is just those unresolved differences which makes knowledge possible. Remove them, and you remove the possibility for knowledge. Try to remove them from Freebase, or to severely restrict them, and Freebase will fail.

It is not needed, so why have such a requirement?

Semantic Nodix: Freebase

01 April 2007

Semantics, Categories and Knowledge

This is the first of a many-part blog about what knowledge is on the web. I will start from the several, conflicting, claims about what knowledge is and how it works, or should work, on-line, and will continue by explaining what I think knowledge is and why it already works on-line.
Why should I worry about such esoteric themes? My concern is that there is a lot of claims being made these days about what knowledge is and how we can automate it on the Web. This includes the many discussions about the Semantic Web and Web 2.0, as well as the endless whine, from some quarters, that the Web is a confused mess and, therefore, urgently needs organising.
My argument, which will take a few instalments to make, is:
  • That what makes the Web a powerful knowledge source is that it is self-organising.
  • That knowledge, and hence understanding and competence, is not a matter of knowing the categories that make up a subject and the rules that relate these categories, but is a process of engaged work and interaction.
  • That what makes something meaningful, and hence knowledge, is a process, not a thing. It is therefore something you do rather than something you get.
  • That knowledge making, and knowledge representation, is necessarily a messy business that requires a diversity of approaches.
  • That translation between logics and knowledges, like languages, is not a straightforward rule based process, but is partial, interpretive and, at times, impossible.
  • Finally, that despite the fact that the Web is a network of computers which are calculating machines, machines that require categories and rules (algorithms) to work, the Web is, nevertheless, an excellent system for sharing and creating knowledge.

Knowledge as organised words and concepts: The Semantic Web:
I want to start by giving my description of The Semantic Web. I know that this seems to be a waste of time as there are a huge number of descriptions and explanations of The Semantic Web, but my description has a slightly different purpose than these many others. My purpose is to examine The Semantic Web as a claim to organise knowledge.
Just to get it out of the way, I should start by stating that I am not opposed to XML or RDF, or even necessarily to ontologies (as used in computing). I regularly use XML and am beginning to use RDF as a means for querying resources. They are both powerful tools and immensely useful. However, as I hope will become clear, they are not straightforward systems for representing knowledge.

The Semantic Web, as defined by the W3C, is not just a series of mark-up languages, but is a multi-tiered model of knowledge (see The Semantic Web Revisited. This vision starts from what it sees as the most basic Web-resource, the URI, and moves up through a series of 'syntaxes' that situate the 'meaning' of the Web-resource in ever more universal categories.
At the base of this pyramid of meaning is what is called a "surface syntax", the XML. XML, as many of you already know, is a convention, as defined by The W3C, for marking-up components of a Web-resource with names (tags). The XML convention does not place any requirements on what these tags should be, just on the syntax of the mark-up. This is what gives XML its wonderful flexibility to represent just about anything in any form and in any language. It is completely local. It allows the author, or authors, to describe the resource using any form of tags, or names, they wish (see W3Schools XML introduction).
Above the XML layer is the XML Schema and Query languages. These are equally useful as they allow the author(s) to describe what they see as restrictions on the structure and content of the XML document. Basically, the XML Schema file is somewhat like a data-definition for a database, though it does a bit more (see W3Schools XML Schema). XML Query, or XQuery, is a developing language for writing queries to XML files, preferably with XLS (XML Schema) definitions (see W3C XQuery overview). Both of these conventions extend the open mark-up of XML so that XML files can be queried as data sources. These conventions are also very flexible and extensible allowing authors to define their data as they wish, and extend these definitions to other XML resources.
Up to this point, it all seems quite sensible. We have a much more flexible and extensible mark-up language, which can define information almost anyway that the author wants. We have a Schema and a Query language that allows for open and extensible definitions for how the author thinks people, or machines, should access and search this resource. Because of the structure of these three documents, we can even see other people writing XML Schemas and XQueries for any XML file whether they authored it or not. This makes for a very open-ended Web.
This may accommodate the needs for flexibility locally, at the file-face, so to speak, however, we are told that "there is no structure", "no agreed terminology", "no order", and hence "no knowledge" at this level. Something must be done, and W3C has done it.
Above these primary layers in the pyramid is the next major layer, the RDF or "Resource Description Framework". W3C defines the RDF as metadata and uses the Dublin Core as its basis. RDF defines the metadata for a file using XML syntax and predicate logic. Predicate logic is simply a form for an assertion. That a subject (a webpage or other URI) has a property (a name, a date, a place, etc.) which is an object ("Robin", "2007-3-28", "Lecce, Italy".) In other words, we can say that this Blog has an author who is Robin.
RDF allows for the association of a set of assertions, set out in predicate logic, about a resource. However, the use of the Dublin Core suggests that the class of statements found in an RDF is of a different order-a higher order. It is at this stage that we see what the Semantic Web is all about. The organisation of knowledge on the web through every higher orders of [general] assertions.
Above the RDF is another layer, the OWL or "Web Ontology Language". The OWL is distinct from the RDF and always refers down to it as OWL is a more restricted language that is intended to be more "machine readable". The justification for this further level is made clear by the W3C's "OWL Web Ontology Language: Overview".
The Semantic Web is a vision for the future of the Web in which information is given explicit meaning, making it easier for machines to automatically process and integrate information available on the Web. The Semantic Web will build on XML's ability to define customized tagging schemes and RDF's flexible approach to representing data. The first level above RDF required for the Semantic Web is an ontology language what can formally describe the meaning of terminology used in Web documents. If machines are expected to perform useful reasoning tasks on these documents, the language must go beyond the basic semantics of RDF Schema.
OWL provides a language for defining Classes, the Members of Classes, the Properties of the Class and the Relationships between Classes. In other words, OWL is a language for defining higher order classifications between objects defined in the RDF. OWL is a type of ontology, as used in computer science, that defines the higher order classes and logical relationships between objects. (see Ontology_ Computer Science at Wikipedia)
Perhaps an example might be helpful here, one drawn from W3C's webschool.
OWL Example (Airport)
OWL Resource: http://www.daml.org/2001/10/html/airport-ont
Class: Airport
Properties:
  • elevation
  • iataCode
  • icaoCode
  • latitude
  • location
  • longitude
  • name
Here a document has defined a class called "airport" which has certain properties (elevation, iataCode, longitude, latitude, etc.).
You may think that this is just another layer of metadata, and you would be right, but this is not its intention. OWL is meant to be a higher-order classification of objects than RDF. Where RDF is fairly open as to what subjects it can assert properties to, OWLs are seen as something much more universal-as essential classes such as persons, mammals, airports, continents, countries, etc. These are seen as "common metadata vocabularies", or fixed vocabularies and definitions. Vocabularies that are shared and fixed either across communities of users or across the Web.
However, even if OWLs are higher-order agreed classes that can handle subsumption and classification, there are many other kinds of logical processes that it cannot handle. To extend the logical and autonomous reach of the Semantic Web, W3C is developing RIF, Rule Interchange Format. The intention of RIF is to allow rules to be translated between rule languages allowing different systems, based on different rule systems, to interoperate.

In the next entry to this blog, I will explore that amorphous collection of grassroots Web that Tim O'Reilly has called Web 2.0.

18 February 2007

What is RESCITE?

RESCITE it a personal blog where I explore my thoughts about WEB 2.0, information and how it moves and is modified in different contexts. This sounds a bit pretentious, as does the name, but I do not mean it to be. I chose the title RESCITE because of the many entangled meanings of the word 'recite' and 'resite', both which imply accounts and movement of knowledge. The strange spelling is not to be cleaver, but because 'recite' was already taken as a blog name.
I welcome comments, as the whole point of the exercise is that I do not, and will not, have an answer, but that this is a set of my ideas and impressions on a general problem. The problem is that we usually assume that we know things by knowing the meaning of words, and that these words have rules for how they are used. Knowing the rules is knowing how or knowing what. However, we all use words differently, at different times and in different settings. We know what we mean, and others know what we mean, because the words make sense in a particular setting. They are used correctly, even if they are not used according to the rules.
Rather than dictating how we use words, ideas, information, knowledge, I argue that we need to find ways where people can use these objects to make sense. That means allowing them to "mis"-use them. It is the many ways that these objects -- accounts, stories, words, information, images, etc. -- can move around and be re-used, made sense of locally, and made use of to communicate with others that I am interested in.